Source-centric Genealogy

Thursday, June 19, 2014

FamilySearch Record Hinting and Linking: Source-centric debut

Today is a great day, because FamilySearch has finally released its "record hinting" feature. When you go to a person page in Family Tree, there is now a little section on the right that shows possibly matching records. Or, more specifically, possibly matching personas in indexed historical records.

When you click on one, you see the persona from the historical record on the left, along with their one-hop relatives (parents above, spouse just below, children below that, and siblings below that). Anyone who isn't one of those kinds of relatives (as far as the record specifies) is listed as "other" at the bottom.

On the right is the corresponding person from Family Tree, along with their one-hop relatives, all lined up with the corresponding people from the historical record (as far as the matching algorithms and data structure can determine).

When you decide that these really are the same person, click "Attach" and the two corresponding people turn green, indicating that they are linked up. At that point, corresponding relatives now have an "Attach" link between them, indicating that they can be attached, too.

If some of the relatives aren't aligned right, you can drag the record persons up and down to align them with the right person in the tree. This can happen if the match algorithm makes a mistake, or if the data in the record or the tree is a bit off, or, most commonly, if the original record didn't contain relationship information. For example, the 1850-1870 U.S. Census collections don't have a "relationship to head" column, so everyone except for the focus person are all listed down in "Other on Record" until you drag them up where you believe they go.

For each relative, you can click "attach", click to copy any new data from the record into the person, click the blue "Attach" button to confirm, and then both people turn green to indicate that they are linked up.

If a person in the record does not appear in Family Tree, then you can click "add" to create a corresponding person in Family Tree, including relationships to the "main" person being dealt with. (You will then have to click "Attach" to finish attaching the persona in the record to the newly-created person in the tree. That should probably happen automatically, so maybe that will get fixed.).

When everyone in the record is green, you know that all of the people in the record are accounted for in Family Tree. And because you have linked them up, the system knows this, too.

(One gal I was helping with family history was an avid board game player, and her eyes lit up when I told her, "For each person you turn green, you get one victory point!")

First full-blown source-centric feature

It may not be obvious to everyone what a big deal these features are. They represent the first major foray of Family Tree into the world of source-centric genealogy. Now, not only can users know what the sources say about their ancestors, but the system can understand that, too. Because persons in Family Tree are linked to personas in the records, the system knows who is who, and thus what information each historical record contains about each person in the Family Tree. It also knows which personas in the record are not yet linked into the tree, and it often knows how those personas are related to people who are in the tree.

Family history seems to have at least 3 main stages:

Living memory. Research your first four generations (up to your great-grandparents) by talking to living people and asking what they remember; looking at personal artifacts lying around your house or the home of a living relative, and so on. You and your living relatives (and the stuff you have in your homes) are the main sources of information.
Recent records: direct evidence. Beyond that, you start into the world of recent records often back to around 1800 or so (depending on where you're researching). There are often quite a few records available, including juicy ones like census records that list entire households. These often provide direct evidence of people and their relationships to other people.
Old records: indirect evidence. Beyond that, things get murkier. There are fewer records, often with less information, and it is less common to find any single record that lists both people in a relationship. You end up having to make larger leaps of logic ("This was the only Mr. Turner in the area, so he must have been the father of Henry...") based on incomplete records.

These new features take great advantage of the recent records. They help users find sources that mention people they already know about, link them up so the system now understands what those sources say about those people, and then add any new information or relatives to the tree. This is the primary way in which many users will grow the tree and discover new relatives. These features have unlocked solid genealogical research for the masses.

Monday, May 05, 2014

Preserving your old family photos at FamilySearch

FamilySearch.org now allows you to upload photos of your ancestors, tag the faces with their names, type a description of the photo, and then link the tagged faces to individuals in the Family Tree. This, in turn, makes those tagged faces show up in pedigree charts and elsewhere in the Family Tree, and all of the pictures that someone is tagged in can appear in the "memories" tab of a FamilySearch person page.

The thing that makes me especially happy about this is that there is finally somewhere I can put my family photos where (1) I feel like they will be safe long-term; and (2) others who are interested in them are likely to find them.

Longevity. The first part is especially important to me. Storing photos on your hard drive just doesn't cut it, as hard drives crash, and when you die, it is likely your computer will get wiped clean eventually. Any service that requires a subscription can't work long-term because, again, once you died, you stop paying, and your photos disappear.

I was somewhat interested in 1000memories.com a couple of years ago, and they were claiming to be able to preserve your photos forever. But how can any company do that? They can't guarantee that they'll be in business 5 years from now, because it takes money to keep the doors open, and the market is fickle. I e-mailed and was given great assurances of their longevity.

A year later, Ancestry.com bought them, and I got an e-mail saying I had 30 days to start paying or my photos would all be purged. So much for long-term.

FamilySearch, on the other hand, doesn't depend on subscriptions or profitability to stay in business. It is sponsored by The Church of Jesus Christ of Latter-Day Saints, which, if you ask them, will be around at least through the "millennium". And they have so many images scanned from historical documents (over a billion so far) that a few million old family photos shouldn't be a noticeable burden.

They even have people to screen the photos to make sure that only family-friendly photos end up on the site.

So although nothing is fool-proof, this seems like the best solution I have heard so far. Still, though, keep a copy for yourself, and share one with all your descendants.

Sharing. The other thing I love about putting old photos on FamilySearch.org is that it is a great place for others to find them. By linking the face of an ancestor to their entry in the free, collaborative Family Tree, anyone who is related to that person and comes across their entry in the tree will immediately notice that there is at least one photo of them. By going to the "Memories" tab of the person page, they will see all the rest of them. So those who are interested in that person will naturally come across all of the photos of them that others have shared.

Similarly, I keep finding more and more of my own relatives that someone else has linked to photos for. I have been able to see photos that I never would have otherwise have know existed.

Suggestions. There are a few things I would like to see FamilySearch do to move this further along towards an ideal solution, of course. These include the following.

Support standard metadata. Currently you have to tag faces and enter a title and description in the UI. But if you have used any other software to enter captions, date, place, and even tag faces, it would be great if FamilySearch would recognize those, and at least give the user the option to use them.

Similarly, if you download a photo, it would be good if the face tags, title and description would be included in the XMP data for the file, including Metadata Working Group face tagging standards (with an extension to also include Family Tree long-lived URIs for the persons involved).
If FamilySearch were to lead out in this, perhaps other genealogical software would start taking advantage of these same standards as well. Then you could tag faces in any number of software packages and take advantage of it in others.

Simple fixes. It would be nice to at least support rotate, and perhaps crop, contrast/auto-levels, etc.
Face identification would let users pick a pre-calculated face box instead of having to draw them by hand, which would help speed it up a little. Face recognition would be cool, too, but might not be worth the trouble, especially since they don't necessarily encourage having large numbers of photographs for the same person, like you might for more recent digital pictures of living people.

Overall, though, I thought this set of features was really well done.

Tuesday, August 30, 2011

Storing stuff both locally and online

Storing stuff locally has the advantage of instant access. You can browse thousands of photos quickly in a way you can't do when they're stored online (though Deep Zoom or SeaDragon techniques can help, as shown in my example). You can navigate or search a family tree instantly when it's stored locally in a way you can't do when the data is stored online.

On the other hand, when something is stored online, it can be accessed from other computers and, more importantly, can be shared with others. It is also likely backed up better than your hard drive typically is.

But trying to store a collection of resources (such as a photo collection, or family tree data) both locally and online often results in a nightmare of trying to keep them synchronized, as you edit, tag, reorganize and even delete things.

One solution is to take an approach similar to Google Docs, where a collection is always modified through a series of deltas. Whether a collection is being modified locally or online, deltas get sent (or stored for later retrieval) so that the different copies are eventually consistent. A mechanism allows for resolution of occasional conflicts.

As an example, let's say I wanted to organize and archive a collection of photos on my hard drive, but wanted to make it available online so that others could help tag faces, and so that they could enjoy seeing them as well.

I could scan the photos into folders. Then I could launch a local "photo archive" app. As I add photos to my archive, it assigns each photo a globally unique permanent ID. It also adds tags to the photo's XMP metadata indicating what its original physical arrangement was (based on folder structure). I could also add information to each folder indicating what physical container it represents.

I could then have my desktop app push the photos and metadata up to an online repository. From then on, any changes I make using a web interface get queued up on the server, and the next time my desktop app connects, it applies the same changes locally. A change log is part of the database, so that changes can be viewed or rolled back. Deletions flag the photos as deleted without actually purging them, so that this, too, can be undone. An actual purge can also be done, in order to reclaim hard drive space, but a user would be prompted before allowing this on their local hard drive.

I could then invite family members to go view the archive of photos online and help tag faces, or estimate when and where the photos were taken.

I could also take the "default primary logical arrangement", which mirrors the hard drive structure, and rearrange photos, reordering within a folder; creating new folders and rearranging those; moving photos from one to the other, etc. As logical arrangements are made, metadata is embedded in the XMP metadata again, so that the database can be reconstructed from the raw files if needed. Names of folders and photos could physically contain numeric prefixes to get them to sort properly in a typical OS; but a UI could hide that (or at least automatically update it) as resources are moved around.

If I modify photos outside of the organizer, the organizer can re-scan the metadata of the photos to figure out what changes have happened. If I copy the whole folder of images (or a subset thereof) to a new computer, I could run the same app (or some future derivative), and it could rebuild the database from the metadata within the images. If I lose interest or kick the bucket, my family members still have access to my photos online, and can grab copies of the ones they're interested in.

And, of course, it would be necessary to be able to restrict public access to certain photos for the sake of privacy.

The same approach could be taken for a family tree. A local family tree could be imported from a GEDCOM file and added to a local database. An on-line database could be created with a copy of the local one. Any changes made to either one would add to the change log, and sent to other when the local database can connect to the Internet. That way, lightning-fast display and editing can happen locally, but global (though privacy-controlled) access and backup can happen online, and synchronization is almost completely automatic, except in the rare case where two desktop apps edit the same person between synchronization, at which point they system could do its best to arbitrate, and let the user override defaults if they want to.

Sunday, February 20, 2011

Archiving that box of stuff for posterity

Everyone has that box of stuff--old family photos, certificates, a family bible with genealogical data in the front, an ancestor's journal, and so on. It may be spread throughout the house and attic rather than gathered in one box, but most people have some of this stuff. If properly preserved, it can be a priceless treasure trove to current and future generations. If mismanaged, it can get destroyed, thrown out, or get so disorganized that it loses much of its value.

For example, a shoe box full of photos without labels on them can, within one generation, move from becoming precious family history to becoming a worthless bunch of photos that nobody can identify. Labeling photos on the back is a great first step, but I'm hoping we can all figure out a reasonable way to digitally archive photos, documents and other things for posterity.

There are a few principles that should be kept in mind as we figure out how to do this.

Arrangement

Physical arrangement. The physical arrangement is the physical grouping and ordering of items within a group, where an "item" can be a single item such as a photo, or a sub-group (such as a box of slides) within a larger group (such as a cardboard box or an attic). Physical arrangement is important because it provides important context that can help us make sense of resources. If we know which photos are part of one box of slides, for example, and the other photos in the box are all of one side of the family, then a face we are having trouble identifying can more reliably be placed on that side of the family. Or if one box is chronologically out of order, the entire box of photos can be moved before another one instead of having to figure that out for each photo, which might otherwise be impossible.
Logical arrangement(s). A "logical arrangement" seeks to organize resources according to some useful scheme, such as chronologically; or in groups like "family" and "travel". Even when arranged chronologically, it may be arranged by "event" such as "trip to Hawaii", or by strict year, month and day. It is even possible to have multiple logical arrangements, though it might be helpful for one of them to be the "primary" one, especially if the files are physically stored according to one of them. (Non-primary logical arrangements would be free to include only a subset of the resources, while the primary one would include them all once and only once).
Digital arrangement. By "digital arrangement", I mean the folder structure on a hard drive. We could choose to have the digital arrangement mirror the physical arrangement. Often this requires prepending a zero-padded number to the beginning of a folder or file name in order to get the files or folders to sort properly in typical operating systems. We could, however, choose to have the digital arrangement mirror a "logical arrangement" (i.e., the "primary" one).

Digital preservation. In addition to initially digitizing photos, audio, movies, documents and other resources, it is important that the resources be protected against being lost or corrupted.

Backup. Hard drives fail. DVDs and CDs degrade. It is important that data be stored in more than one place, and organized well enough that we know when one resource is a duplicate or backup of another one. Ideally, things would be backed up online in more than one place.
Format shift. Still have a 5.25" floppy drive? Me, neither. Media formats change, so digital data needs to be migrated from one format to another as that happens. File formats go obsolete, too, so data needs to be migrated from WordPerfect to MS Word .docx; or from JPEG to whatever the next thing is. We usually don't think of doing this very often, so ideally, an online preservation service would do this automatically for all of its resources.
Apathy. Your grandfather passes away, and you only have 2 days off of work to go through all of his stuff. You don't have time to make sense of all those files on his ancient PC, so you wipe the hard drive and drop it off at a local charity for resale. So much for his lifelong efforts to digitize, tag and preserve precious family photos. A lady I know went to her grandfather's house after he passed away, and before she arrived, her sister had thrown out the journals that he had kept for his whole life. You never know how ignorant people are going to be when it comes to precious resources like these, so it needs to be kept somewhere that posterity can still access and use it in spite of who get entrusted with the original resources temporarily.

Sharing. Resources can and should be shared with others, but often only a piece or subset of the collection is shared. Those who end up with one photo from a collection should have a way of reconnecting with the original collection. Again, this could be addressed by an online archive with long-lived URLs for resources and collections of resources (and collections of collections, and so on). A photo could then point to one or more online copies of where information about its collection structure can be found.

I was intrigued by the Saturday morning keynote talk given at RootsTech 2011 about the Internet Archive. Their goal is to archive everything forever for free, as far as I could tell. As they are probably well aware, there is a big difference between just backing stuff up and "archiving" it, just as there is a big difference between a photo album and a shoe box. Knowing what you have, how it is arranged, and therefore what it "means" is almost as important as keeping it stored at all. The Internet Archive is one organization that might be able to store people's "box of stuff". I can imagine FamilySearch or other organizations offering "Preservation as a service", too. Ideally, a single user's archive would be stored in more than one of these, in case one organization goes under, has a disaster, or has a shift in priorities that puts their archives in danger. Privacy is another tricky issue with archives. On the one hand, we want to preserve photos for posterity, which means that we want our posterity (as well as current living relatives) to have access to it. On the other hand, privacy laws in many countries (especially Europe) make it illegal for one person to reveal information about another living person without their express permission. One option is to allow users to flag resources as public or not; and allow other users to flag resources as non-public if they don't want the pictures or information out there (i.e., "opt out"). And it's possible to have a timeout on resources (e.g., 110 years later, any living people mentioned in the resource can assumed to be dead). Or access could be restricted from the countries that have the stricter laws? Not sure. Comment if you have any good ideas on how to approach this part.

Assuming that the privacy part can be figured out, though, we still need a way to archive things in a way that has a good chance of preserving resources and their context long-term.

Tuesday, December 07, 2010

Tagging people in photos, for posterity

I have long wished that there was a standard format for identifying people in digital photos. Facebook has brought face-tagging to the masses, and I believe that they have a pretty good model: You can tag a region of the image with either free-form text, or you can pick from a list. The beauty of the latter option is that not only do you attach a name to a face, but you attach an identity.

Once the computer knows not just the name that goes with that face, but the identity, it can do things like notify that user that they've been tagged in an image, and so on.

The thing that is missing, however, is that the tag exists only at Facebook. I have a collection of old photos, and in order to identify the individuals, I have had to resort to text files next to the photos in the computer; or cramming info into a file name; or adding captions to the photo metadata. But all of these leave some room for ambiguity, and none of them go beyond providing a simple name. And no software is going to know how to identify these people.

What we need is a standard metadata tagging scheme that can be used to identify a region in an image, and attach a name, and, optionally, URLs or typed identifiers to identify who this person is in various other systems. For example, I could tag my great-grandfather as "James Kay Polk Gray", and then attach a "new FamilySearch" person ID; and maybe a URL to an entry on "biographicalwiki.org"; and maybe another URL to my personal online family tree. By using multiple links and IDs, it is more likely that at least one of them will survive until someone wants to look the ancestor up.

I was hoping that Adobe's XMP format would provide a place for this kind of metadata, but someone from Adobe said that while this was an interesting idea, they hadn't intended XMP to include sub-image metadata. The president of IPTC (the organization that defines the metadata tags used in many digital photos) said that something like this was on their roadmap, but I don't know how that has progressed.

FaceBook, iPhoto, Picasa and Photoshop Elements all allow you to tag faces in photos. However, it takes a while to do this, and if the tags aren't portable, then I'm not ready to spend the hours necessary to do it. I'm hopeful that the industry can come up with a standard for doing this sort of thing.

The same standard might even work for tagging words in a text image (e.g., for OCR); or for tagging words in a handwritten image (like in a genealogical document).

Wednesday, March 08, 2006

Source-centric genealogy overview

Welcome to the source-centric genealogy blog.

Source-centric genealogy is an approach to doing genealogy in which information is viewed as flowing from sources to evidence and finally to conclusions. This is done in such a way that conclusions can be traced back to the evidence they are based on, and the evidence knows what part of which source it came from. Another important aspect of source-centric genealogy is that it makes it possible to take a source and see what evidence has been extracted from it, and to see what conclusions have been drawn from a particular piece of evidence. This makes it possible to avoid unending duplication of work.

The essential elements of a source-centric genealogical system include:
1. A source authority, which tracks all known sources of genealogical data.
2. An artifact archive, which holds images of records and other digital artifacts for convenience in accessing original records.
3. A structured data archive, which holds structured genealogical data that has been extracted from individual sources. Its purpose is to accurately represent what a source says.
4. A family tree, which holds conclusions about what real people have lived and how they are related. Each person in the family tree has links to the various entries in the structured data archive that are believed to refer to the same real person.

It is also important that verification work be tracked so that it, too, can be done "once" and "for all" instead of having to be repeated by everyone.

There were several papers on this topic by Randy Wilson at the 2002, 2003 and 2006 Family History Technology workshops presented at BYU. Below are links to each paper:

Bidirectional Source Linking: Doing Genealogy 'once' and 'for all'--explains the importance of being able to follow links from a source to conclusions as well as the other way around.
Efficient Genealogy through Personal Extraction and Automated Verification--explains the importance of extracting structured evidence from sources separately from making conclusions using that evidence. Also explains why it is important to track verification work.
High-level View of a Source-Centric Genealogical Model: "The Model with Four Boxes"--summarizes the most crucial elements of a source-centric genealogical model and explains why each is critical.