Sunday, May 21, 2006

Towards the ToL database - some visions

So, when I started this blog I promised to write something about phyloinformatics, and the goal of a phylogenetic database. I've been playing around with various ideas, some of which have made it online, but most remain buried on various hard drives until they get written up to the state they are useable.

There are also numerous distractions, and detours along the way, such as MyPHPBib, Taxonomic Search Engine, and LSIDs, oh and iSpecies, which got me into trouble with Google, then there is a certain journal, and a certain person (but let's not go there...).

My point (and I do have one), is that maybe it's time to rethink some cherished ideas. Basically, my original goal of creating a phylogenetic database involved massive annotation, disambiguation of taxonomic names, and linking to global identifiers for taxonomic names, sequences, images, and publications. This is the project outlined at the start of this blog.

I still believe this would be worthwhile, and I've a lot of the work done for TreeBASE (e.g., mapping TreeBASE names to external databases, BLASTing sequences in ttreeBASE to get accession numbers, etc.). This is a lot of work, and I wonder about scalability and involvement. In other words, can it cope with the amount of data and trees we generate, and how do we get people to contribute. So, here are a few different (not necessarily exclusive approaches).

Use TreeBASE as a seed and continue to grow that database, adding extensive annotations and cross links. Time consuming, but potentially very powerful, especially is data is dumped into a triple store and cool ways to query it are developed.

Googolise everything
Use Google to crawl for NEXUS files (e.g., "#nexus" "begin data" format dna), extract them and put them into a database. Use string matching and BLAST to figure out what the files are about.

Phylogeny news
Monitor NCBI and journal RSS feeds, when new sequences or papers appear, extract popsets, use or build alignments, compute trees quickly and wack into a database. Interface is something like Postgenomic (maybe using the same clustering algorithms to link related stories), or even cooler, newsmap

Connotea style

Inspired by projects like Connotea, perhaps the trick is to mobilise the community by lowering the barrier to entry. Instead of aiming for a carefully curated database, what if people could upload the basics (some sort of identifier for the paper, such as a DOI or a PubMed id, and one or more trees in Newick format). I think this is what Wayne Maddison was suggesting when we chatted at the CIPRES (see my earlier post about that meeting) -- if Wayne didn't suggest this, then my apologies. The idea would be that people could upload the bare minimum, but be able to annotate, comment, link, etc. Behind the scenes we have scripts to look up whatever identifiers we have and extract relevant metadata.


Simon Rycroft said...

Have you tried searching for Nexus files using the filetype google option?

Search for:

'filetype:nex begin'
'filetype:nex format dna'

Rod Page said...

Now why didn't I think of that (doh!). I guess because in the advanced search page for Google they list some formats, such as PDF and DOC, and I'd assumed the filetype search was limited to those files. As they say, "assumption is the mother of all f***ups".
Now all we need is a script to grab the files and make something. Tree of Life database, here we come ;-)