Sunday, August 28, 2011

Tree of Life 0.1 - annotating the NCBI taxonomy

Last week I was at the NSF "Assembling, Visualising and Analysing the Tree of Life" Ideas Lab, run by KnowInnovation.com/. It was an interesting experience, essentially a structured week of brainstorming ideas.

One thing I came away with is the feeling that our notions of the "tree of life" are fuzzy, contradictory, and often probably unobtainable. It's tempting to imagine all sorts of wonderful visualisations, and loose sight of building something that is useful. Perhaps it's time instead to think of "Tree of Life version 0.1".

Imagine taking the NCBI taxonomy as a starting point. Yes it's incomplete, and has almost no fossils, but it's freely available and linked to a lot of data. Let's use a Google Maps-like viewer along the lines I explored earlier this year.

Then add annotation "tracks" to the tips. As a first pass these could be taken from the NCBI LinkOut service, such as the NCBI-Wikipedia mapping http://iphylo.org/linkout.

Ncbi 1

The NCBI tree is a classification rather than a phylogeny, so we could add greater phylogenetic content by linking to phylogenetic databases, such as TreeBASE and PhyLoTA. Imagine clicking on a node in the NCBI taxonomy and seeing a display of all the phylogenies centred on that node:

Ncbi 02

Now we have a way to navigate a large tree, view annotations, and display phylogenetic trees. All of this could be done fairly easily. The key is to have services keyed by the NCBI tax_id used to identify nodes on the tree.

Among the next steps would be to add additional "tracks", perhaps based on curated links analogous to the wiki-based NCBI-Wikipedia mapping. For example, very basic habitat data (marine or terrestrial) could be added, or geography, or host relationships (could be based in part on the data already in GenBank).

Given that the NCBI tree continues to grow, subsequent versions could be released as the tree changes. Or we could "fork" the NCBI tree and start to refine it based on phylogenetic information, and add taxa that aren't in the genome databases (these taxa will need consistent identifiers so we can map annotations on to them as well). Perhaps we could use something like Git to manage this tree, and to handle the necessary merging of updated versions of the NCBI tree. People could edit the tree, or indeed fork it and come up with their own.

Logo tmp reasonably smallThere are lots of ways to visualise trees (see TreeVis.net for some great examples), but what I'm after is a tool that is useful, that gives us a sense of what we know and what we don't. I suspect that one of the reasons we've struggled with visualising the tree of life is that there are lots of different notions about what it's for. In this case, I want a tool to navigate data about organisms, one that we can easily add annotations too.