Monday, February 04, 2008

A database of everything

Nothing like a little hubris first thing Monday morning...

After various experiments, such as a triple store for ants (documented on the Semant blog) and bioGUID (documented on the bioGUID blog), I'm starting from scratch and working on a "database of everything". Put another way, I'm working on a database that aggregates metadata about specimens, sequences, literature, images, taxonomic names, etc. But beyond "merely" aggregating data I'm really interested in linking the data. Here are some design decisions I've made so far.

The first is that the database structure used to store metadata uses the Entity–Attribute–Value (EAV) model (for some background see the papers I've bookmarked on Connotea with the tag entity–attribute–value). This tends to send traditional database managers into fits, but for my purposes (storing objects of different kinds with frequently sparse metadata) it seems a natural choice. It's a little bit like a triple store, but I find triple stores frustrating when creating a database. Given the often low quality of the incoming data a lot of post-processing may be needed. Having the data easily accessible makes this task easier -- I view triple stores as "read only" databases. The data I store will be made available as RDF, so users could still use Semantic Web tools to analyse it.

Each object stored in the database gets an identifier based on a md5 hash of one of it's GUIDs (e.g., PubMed identifier, DOI, URL), a technique used by del.icio.us and Connotea (for example, the md5 hash of "http://dx.doi.org/10.2196/jmir.5.4.e27" is d00cf429c001c3c7ae4f2d730718dcc8, hence the Connotea URI for this article is http://www.connotea.org/article/d00cf429c001c3c7ae4f2d730718dcc8). The URIs for the objects in my database will also use these md5 hashes. The original GUIDs (such as DOIs, etc.) are stored within the database.

Much of the programming for this project involves retrieving metadata associated with an identifier. Taking the code I wrote for bioGUID as a starting point, I've added a lot of code to try and clean up the metadata (fixing dates, extracting latitude and longitudes and such like, see Metacrap), but also to extract additional identifiers from text. The database is fundamentally about these links between objects. Typically I write a client for a service that returns metadata for an identifier (typically in XML), transform it to JSON, then post process it.

Another decision is that names of genes and taxa are treated as tags, postponing any decision about what the names actually refer to, as well as whether two or more tags refer to the same thing. The expectation is that much of this can be worked out after the fact (in much the same way as I did for the TbMap project).

The database is being populated in two ways, spidering and bulk upload. Spidering involves taking an identifier, finding any linked identifiers, and resolving those. For example, a PubMed identifier may link to a set of nucleotide sequences, many of which may link to a specimen code. Bulk upload involves harvesting large numbers of records and adding them to the database.

The initial focus is on adding records related to TreeBASE, with a view to creating an annotated version of that database such that a user could ask questions like "find me all phylogenies in TreeBASE that include taxa from Australia."

6 comments:

Unknown said...

I would be glad to participate if you need some help.

Cheers
Paulo

Roderic Page said...

Thanks. I think what I need to do first is get the code working properly, and documented sufficiently to make it public. I'll keep you posted.

Drycafe said...

Hi Rod - very exciting to hear you're pursuing this! I have been hatching similar ideas towards a generic, semantics-enabled deployable trait database.

The NCBO and affiliated groups have done some significant work in this direction which you might want to check out.

More specifically, they have conceived an EAV-based model (meanwhile called EQ-model, as A and V are subsumed into a Quality) for formalizing phenotype descriptions, which we have adopted within PhenoScape (http://phenoscape.org) to formalize evolutionary characters in a compatible way. More to your project, Chris Mungall has done some significant prototype work towards the OBD, a database of any ontology-annotated biological data (http://www.bioontology.org/wiki/index.php/OBD:Main_Page has an overview).

Roderic Page said...

Ironically, after having given TreeBASE such a hard time for not having an ontology (i.e., a taxonomy), I find myself increasingly sceptical about large-scale ontologies. Perhaps it's the result of reading too many books like Everything is miscellaneous, but things like automatic clustering of tags (e.g., here) look promising. Your recent post Integrating ontologies is a mess doesn't fill me with confidence that big ontologies are the way to go. I think global identifiers and linking (perhaps with some very basic semantics, such as x cites y) is the way to go. There's much to be said for keeping things simple...

Unknown said...

Yep, let me know. It would be a pleasure to help.

Cheers

Drycafe said...

For the ideas I was mentioning, the way I like to think about (e.g., trait) ontologies in this context is more along the lines of - user-driven - agile and mass development of controlled vocabularies, not what I imagine you mean by 'big' ontologies. You have come across some interesting work in this regard by Mark Wilkinson (UBC, not NHM), for example
http://psb.stanford.edu/psb-online/proceedings/psb06/good.pdf and http://hdl.nature.com/10101/npre.2007.945.2

But indeed what you're aiming for may not need any ontologies. BTW have you thought about building this on top of Freebase? I just wish the software were available ...