Tuesday, January 16, 2007

A manifesto

The funding of pPOD mentioned earlier today motivates me to write some notes on what I think "core database technologies for enabling the integration of AToL data" could, or indeed, should be about. Much of what follows I've mentioned elsewhere on the iPhylo blog (for example here and related blogs SemAnt and iSpecies) but it seems useful to bring this together here.

It's not about algorithms
CIPRES seems to be to be basically about putting "go faster" stripes on phylogenetic algorithms. I don't mean this to be as dismissive as it sounds, this is intellectually challenging stuff. It's just that projects such as TreeBASE seem adrift in that environment. See my CIPRES talk for more.

It's about integration
Integrating biodiversity information is a hot topic, and I think this is were most of easy stuff is. Easy in the sense that much of the intellectual work has been done, most of it elsewhere. We have GUIDs (e.g., DOIs and LSIDs), RDF, triple stores, and query languages (e.g., SPARQL). Providing we avoid getting embroiled and/or bogged down in some of the details, I think this stuff is technically pretty straightforward. I've been experimenting on some of these ideas in the context of a project integrating information on ants, see my SemAnt blog for a record of this work. There are also related posts on iSpecies.


It's about new queries
My own view is that much of the work on tree searching in TreeBASE, for example by Jason Wang and collaborators — while interesting — is misplaced. I don't get the sense that biologists are really interested in asking the question "find me trees like this". Rather, I think biologists are really interested in questions such as "find me trees that have x more closely related to y than to z", or "find me trees in which group x is/is not monophyletic". I think these are pattern matching queries, or more fundamentally, I think they are all in essence least common ancestor (LCA) queries. Indeed, once stripped of all the rhetoric about bringing classification into the 21st century (and the nonsense about renaming species), the phylocode boils down to named LCA queries.

What we also need are queries that deal with geography and time. Work on interval queries seems relevant here. Ideally we'd move beyond GIS queries to pattern matching geographically-labelled trees (finally providing tools for cladistic biogeography).

It's about branch lengths
Brian O'Meara's comments on thhis blog reminded me that I'd forgotten about edge (=branch) lengths. Although these are implicit in my discussion of chronograms (see below), as Brian notes:
While systematists may be most interested in relatedness, many biologists will use trees for investigating trait evolution or as ways to control for phylogenetic relatedness (contrasts), and for this they need branch lengths.

Brian also mentions the potential storage issue for Bayesian trees, i..e storing the results of MCMC runs. If we want to store only topologies, it also seemed to me that there might be some clever ways to store only the difference between successive trees, given that each tree is a perturbation of the previous one (e.g., a NNI). Storing edge lengths complicates this, although they too are related to those in the previous tree. Is there a smart way to store these things, or do we just gzip the tree file and stick it on a server?

It's about new visualisations
Bill Piel's work on putting phylogenies in Google Earth, and related work by Daniel Janies et al. (coming out soon in Systematic Biology) show the potential of geographic visualisation.

Earlier on this blog, I noted the parallel between genome browsers and chronograms.


Continuing the theme of visualising phylogenies, one thing which strikes me is the parallel between genome browsers that display annotation "tracks" (such as the UCSC Genome Browser) and illustrations of "chronograms" with geological periods and accompanying data, such as sea levels, isotope levels, etc. In my haste I couldn't find an example with a sea-level track, but I know they exist … In both cases there is a natural co-ordinate system (genome location and time, respectively) going from left to right, and annotations that can be added using the same frame of reference.


Dating phylogenetic trees is currently "hot", but phylogeny databases don't support dated trees.

It's about collaboration
I think there are lots of tools being developed elsewhere, such as Connotea, Flickr, and EditGrid that can be utilised (or used as sources of inspiration). These provide tools for managing bibliographic data, images, and spreadsheets. Let's not reinvent this stuff. For example, Connotea can be integrated with TreeBASE, Flickr can be used to store images with metadata, and EditGrid can be used to create collaborative data matrices, as well as simple annotations. And, speaking of annotations, blogs seem to provide ideal tools for this.
My point here is developing domain-specific tools for this stuff seems to me to be a huge mistake.

In summary, my own perspective is that one way to tackle this problem is to take advantage of the swarm of community-driven, open API, folksonomy-based tools that are flooding the web.