Wednesday, December 15, 2010

TreeBASE, again

My views on TreeBASE are pretty well known. Lately I've been thinking a lot about how to "fix" TreeBASE, or indeed, move beyond it. I've made a couple of baby steps in this direction.

The first step is that I've created a group for TreeBASE papers on Mendeley. I've uploaded all the studies in TreeBASE as of December 13 (2010). Having these in Mendeley makes it easier to tidy up the bibliographic metadata, add missing identifiers (such as DOIs and PubMed ids), and correct citations to non-existent papers (which can occur if at the time the authors uploaded their data the planned to submit their paper to one journal, but it ending up being accepted in another). If you've a Mendeley account, feel free to join the group. If you've contributed to TreeBASE, you should find your papers already there.

The second step is playing with CouchDB (this years new hotness), exploring ways to build a database of phylogenies that has nothing much to do with either a relational database or a triple store. CouchDB is a document store, and I'm playing with taking NeXML files from TreeBASE, converting them to something vaguely usable (i.e., JSON), and adding them to CouchDB. For fun, I'm using my NCBI to Wikipedia mapping to get images for taxa, so if TreeBASE has mapped a taxon to the NCBI taxonomy, and that taxon has a page in Wikipedia with an image, we get an image for that taxon. The reason for this is I'd really like a phylogeny database that was visually interesting. To give you some examples, here are trees from TreeBASE (displayed using SVG), together with thumbnails of images from Wikipedia:

myzo.png


troidini.png


protea.png


Snapshot 2010-12-15 10-38-02.png


Everything (tree and images) is stored within a single document in CouchDB, making the display pretty trivial to construct. Obviously this isn't a proper interface, and there's things I'd need to do, such as order the images in such a way that they matched the placement of the taxa on the tree, but at a glance you can see what the tree is about. We could then envisage making the images clickable so you could find out more about that taxon (e.g., text from Wikipedia, lists of other trees in the database, etc.).

We could expand this further by extracting geographical information (say, from the sequences included in the study) and make a map, or eventually a phylogeny on Google Earth) (see David Kidd's recent "Geophylogenies and the Map of Life" for a manifesto doi:10.1093/sysbio/syq043).

One of the big things missing from databases like TreeBASE is a sense of "fun", or serendipity. It's hard to find stuff, hard to discover new things, make new connections, or put things in context. And that's tragic. Try a Google image search for treebase+phylogeny:

treebasephylogeny.png

Call me crazy, but I looked at that and thought "Wow! This phylogeny stuff is cool!" Wouldn't it be great if that's the reaction people had when they looked at a database of evolutionary trees?