Nothing like a little hubris first thing Monday morning...
After various experiments, such as a triple store for ants (documented on the Semant blog) and bioGUID (documented on the bioGUID blog), I'm starting from scratch and working on a "database of everything". Put another way, I'm working on a database that aggregates metadata about specimens, sequences, literature, images, taxonomic names, etc. But beyond "merely" aggregating data I'm really interested in linking the data. Here are some design decisions I've made so far.
The first is that the database structure used to store metadata uses the Entity–Attribute–Value (EAV) model (for some background see the papers I've bookmarked on Connotea with the tag entity–attribute–value). This tends to send traditional database managers into fits, but for my purposes (storing objects of different kinds with frequently sparse metadata) it seems a natural choice. It's a little bit like a triple store, but I find triple stores frustrating when creating a database. Given the often low quality of the incoming data a lot of post-processing may be needed. Having the data easily accessible makes this task easier -- I view triple stores as "read only" databases. The data I store will be made available as RDF, so users could still use Semantic Web tools to analyse it.
Each object stored in the database gets an identifier based on a md5 hash of one of it's GUIDs (e.g., PubMed identifier, DOI, URL), a technique used by del.icio.us and Connotea (for example, the md5 hash of "http://dx.doi.org/10.2196/jmir.5.4.e27" is d00cf429c001c3c7ae4f2d730718dcc8, hence the Connotea URI for this article is http://www.connotea.org/article/d00cf429c001c3c7ae4f2d730718dcc8). The URIs for the objects in my database will also use these md5 hashes. The original GUIDs (such as DOIs, etc.) are stored within the database.
Much of the programming for this project involves retrieving metadata associated with an identifier. Taking the code I wrote for bioGUID as a starting point, I've added a lot of code to try and clean up the metadata (fixing dates, extracting latitude and longitudes and such like, see Metacrap), but also to extract additional identifiers from text. The database is fundamentally about these links between objects. Typically I write a client for a service that returns metadata for an identifier (typically in XML), transform it to JSON, then post process it.
Another decision is that names of genes and taxa are treated as tags, postponing any decision about what the names actually refer to, as well as whether two or more tags refer to the same thing. The expectation is that much of this can be worked out after the fact (in much the same way as I did for the TbMap project).
The database is being populated in two ways, spidering and bulk upload. Spidering involves taking an identifier, finding any linked identifiers, and resolving those. For example, a PubMed identifier may link to a set of nucleotide sequences, many of which may link to a specimen code. Bulk upload involves harvesting large numbers of records and adding them to the database.
The initial focus is on adding records related to TreeBASE, with a view to creating an annotated version of that database such that a user could ask questions like "find me all phylogenies in TreeBASE that include taxa from Australia."