Tuesday, August 30, 2011

Storing stuff both locally and online

Storing stuff locally has the advantage of instant access. You can browse thousands of photos quickly in a way you can't do when they're stored online (though Deep Zoom or SeaDragon techniques can help, as shown in my example). You can navigate or search a family tree instantly when it's stored locally in a way you can't do when the data is stored online.

On the other hand, when something is stored online, it can be accessed from other computers and, more importantly, can be shared with others. It is also likely backed up better than your hard drive typically is.

But trying to store a collection of resources (such as a photo collection, or family tree data) both locally and online often results in a nightmare of trying to keep them synchronized, as you edit, tag, reorganize and even delete things.

One solution is to take an approach similar to Google Docs, where a collection is always modified through a series of deltas. Whether a collection is being modified locally or online, deltas get sent (or stored for later retrieval) so that the different copies are eventually consistent. A mechanism allows for resolution of occasional conflicts.

As an example, let's say I wanted to organize and archive a collection of photos on my hard drive, but wanted to make it available online so that others could help tag faces, and so that they could enjoy seeing them as well.

I could scan the photos into folders. Then I could launch a local "photo archive" app. As I add photos to my archive, it assigns each photo a globally unique permanent ID. It also adds tags to the photo's XMP metadata indicating what its original physical arrangement was (based on folder structure). I could also add information to each folder indicating what physical container it represents.

I could then have my desktop app push the photos and metadata up to an online repository. From then on, any changes I make using a web interface get queued up on the server, and the next time my desktop app connects, it applies the same changes locally. A change log is part of the database, so that changes can be viewed or rolled back. Deletions flag the photos as deleted without actually purging them, so that this, too, can be undone. An actual purge can also be done, in order to reclaim hard drive space, but a user would be prompted before allowing this on their local hard drive.

I could then invite family members to go view the archive of photos online and help tag faces, or estimate when and where the photos were taken.

I could also take the "default primary logical arrangement", which mirrors the hard drive structure, and rearrange photos, reordering within a folder; creating new folders and rearranging those; moving photos from one to the other, etc. As logical arrangements are made, metadata is embedded in the XMP metadata again, so that the database can be reconstructed from the raw files if needed. Names of folders and photos could physically contain numeric prefixes to get them to sort properly in a typical OS; but a UI could hide that (or at least automatically update it) as resources are moved around.

If I modify photos outside of the organizer, the organizer can re-scan the metadata of the photos to figure out what changes have happened. If I copy the whole folder of images (or a subset thereof) to a new computer, I could run the same app (or some future derivative), and it could rebuild the database from the metadata within the images. If I lose interest or kick the bucket, my family members still have access to my photos online, and can grab copies of the ones they're interested in.

And, of course, it would be necessary to be able to restrict public access to certain photos for the sake of privacy.

The same approach could be taken for a family tree. A local family tree could be imported from a GEDCOM file and added to a local database. An on-line database could be created with a copy of the local one. Any changes made to either one would add to the change log, and sent to other when the local database can connect to the Internet. That way, lightning-fast display and editing can happen locally, but global (though privacy-controlled) access and backup can happen online, and synchronization is almost completely automatic, except in the rare case where two desktop apps edit the same person between synchronization, at which point they system could do its best to arbitrate, and let the user override defaults if they want to.




Sunday, February 20, 2011

Archiving that box of stuff for posterity

Everyone has that box of stuff--old family photos, certificates, a family bible with genealogical data in the front, an ancestor's journal, and so on. It may be spread throughout the house and attic rather than gathered in one box, but most people have some of this stuff. If properly preserved, it can be a priceless treasure trove to current and future generations. If mismanaged, it can get destroyed, thrown out, or get so disorganized that it loses much of its value.

For example, a shoe box full of photos without labels on them can, within one generation, move from becoming precious family history to becoming a worthless bunch of photos that nobody can identify. Labeling photos on the back is a great first step, but I'm hoping we can all figure out a reasonable way to digitally archive photos, documents and other things for posterity.

There are a few principles that should be kept in mind as we figure out how to do this.
  • Arrangement
    • Physical arrangement. The physical arrangement is the physical grouping and ordering of items within a group, where an "item" can be a single item such as a photo, or a sub-group (such as a box of slides) within a larger group (such as a cardboard box or an attic). Physical arrangement is important because it provides important context that can help us make sense of resources. If we know which photos are part of one box of slides, for example, and the other photos in the box are all of one side of the family, then a face we are having trouble identifying can more reliably be placed on that side of the family. Or if one box is chronologically out of order, the entire box of photos can be moved before another one instead of having to figure that out for each photo, which might otherwise be impossible.
    • Logical arrangement(s). A "logical arrangement" seeks to organize resources according to some useful scheme, such as chronologically; or in groups like "family" and "travel". Even when arranged chronologically, it may be arranged by "event" such as "trip to Hawaii", or by strict year, month and day. It is even possible to have multiple logical arrangements, though it might be helpful for one of them to be the "primary" one, especially if the files are physically stored according to one of them. (Non-primary logical arrangements would be free to include only a subset of the resources, while the primary one would include them all once and only once).
    • Digital arrangement. By "digital arrangement", I mean the folder structure on a hard drive. We could choose to have the digital arrangement mirror the physical arrangement. Often this requires prepending a zero-padded number to the beginning of a folder or file name in order to get the files or folders to sort properly in typical operating systems. We could, however, choose to have the digital arrangement mirror a "logical arrangement" (i.e., the "primary" one).
  • Digital preservation. In addition to initially digitizing photos, audio, movies, documents and other resources, it is important that the resources be protected against being lost or corrupted.
    • Backup. Hard drives fail. DVDs and CDs degrade. It is important that data be stored in more than one place, and organized well enough that we know when one resource is a duplicate or backup of another one. Ideally, things would be backed up online in more than one place.
    • Format shift. Still have a 5.25" floppy drive? Me, neither. Media formats change, so digital data needs to be migrated from one format to another as that happens. File formats go obsolete, too, so data needs to be migrated from WordPerfect to MS Word .docx; or from JPEG to whatever the next thing is. We usually don't think of doing this very often, so ideally, an online preservation service would do this automatically for all of its resources.
    • Apathy. Your grandfather passes away, and you only have 2 days off of work to go through all of his stuff. You don't have time to make sense of all those files on his ancient PC, so you wipe the hard drive and drop it off at a local charity for resale. So much for his lifelong efforts to digitize, tag and preserve precious family photos. A lady I know went to her grandfather's house after he passed away, and before she arrived, her sister had thrown out the journals that he had kept for his whole life. You never know how ignorant people are going to be when it comes to precious resources like these, so it needs to be kept somewhere that posterity can still access and use it in spite of who get entrusted with the original resources temporarily.
  • Sharing. Resources can and should be shared with others, but often only a piece or subset of the collection is shared. Those who end up with one photo from a collection should have a way of reconnecting with the original collection. Again, this could be addressed by an online archive with long-lived URLs for resources and collections of resources (and collections of collections, and so on). A photo could then point to one or more online copies of where information about its collection structure can be found.
I was intrigued by the Saturday morning keynote talk given at RootsTech 2011 about the Internet Archive. Their goal is to archive everything forever for free, as far as I could tell. As they are probably well aware, there is a big difference between just backing stuff up and "archiving" it, just as there is a big difference between a photo album and a shoe box. Knowing what you have, how it is arranged, and therefore what it "means" is almost as important as keeping it stored at all. The Internet Archive is one organization that might be able to store people's "box of stuff". I can imagine FamilySearch or other organizations offering "Preservation as a service", too. Ideally, a single user's archive would be stored in more than one of these, in case one organization goes under, has a disaster, or has a shift in priorities that puts their archives in danger. Privacy is another tricky issue with archives. On the one hand, we want to preserve photos for posterity, which means that we want our posterity (as well as current living relatives) to have access to it. On the other hand, privacy laws in many countries (especially Europe) make it illegal for one person to reveal information about another living person without their express permission. One option is to allow users to flag resources as public or not; and allow other users to flag resources as non-public if they don't want the pictures or information out there (i.e., "opt out"). And it's possible to have a timeout on resources (e.g., 110 years later, any living people mentioned in the resource can assumed to be dead). Or access could be restricted from the countries that have the stricter laws? Not sure. Comment if you have any good ideas on how to approach this part.

Assuming that the privacy part can be figured out, though, we still need a way to archive things in a way that has a good chance of preserving resources and their context long-term.