Tuesday, January 30, 2007

Phylocode

Comments by David Marjanović elsewhere on this blog (here and here) about TreeBASE, classification and Phylocode have prompted me to write a little bit about why I'm underwhelmed by the Phylocode. Suppose I have the question:
"find me all studies in TreeBASE that contain birds"?

How do I answer this? Well, my approach is to do the following. Firstly, I attempt to map every name in TreeBASE onto a name in an external database, such as NCBI Taxonomy, uBio, etc. Once I've done this, I download the NCBI taxonomic classifcation, and use it to query the mapped names.

Querying a classification
The basic idea for querying trees is nicely explained by Aaron Mackey in his article Relational Modeling of Biological Data: Trees and Graphs, and I've used this approach elsewhere in the Glasgow Taxonomic Name Server. You take the tree, compute left and right visitation numbers, and then use those numbers to query the data base. For more details take a look at my notes here. For example, given the tree below (taken from Aaron's article) let's imagine that node 4 (shown in blue in the diagram) is "Aves".

Then, if I search for nodes with a left_id > 3, and a right_id < 10, I get nodes 10, 11, and 12. These are the birds. So, I find each name in TreeBASE that maps onto these nodes, and those are the birds in TreeBASE.

Phylogenetic nomenclature
How would I do this using phylogenetic nomenclature? Well, I can't just use the specifiers of a name, because there is no guarantee that those specifiers will be in the tree. For example, Paul Sereno's Taxon Search site lists the definition of "Aves" as:
The least inclusive clade containing Archaeopteryx lithographica Meyer 1861 and Passer domesticus (Linnaeus 1758).

Neither taxon occurs in TreeBASE to date! So, how would I search for birds (and be confident that I can retrieve all the studies on birds)?
Well, if I had a larger tree (such as a supertree or, say, the NCBI classification) that had the specifiers (i.e., in this case it had Archaeopteryx and the sparrow), I could find the least common ancestor (LCA) or those two taxa on the tree, that node would be "Aves", and then I can use the technique described above to find all studies on birds.

Conclusion
So, I'm underwhelmed. In practice the approach is the same, use a large tree, locate the node from which all your taxa descend, and find studies with those taxa. Using a classification such as NCBI, I have a large tree complete with internal nodes labelled. Hence, to find all studies containing birds I find the node labelled "Aves" and do the query. To use phylogenetic names, I also need a large tree, and I need to look up the LCA of the specifiers, then do the query in the same way. So, in practise the difference is minor, although for phylogenetic names there is the issue of what tree to use. One could argue that the classification approach is ready to go - just grab the NCBI tree.

Now, there are problems of course. For palaeontologists, the nearly complete lack of extinct taxa in the NCBI tree is a problem, because unless a TreeBASE study has at least one extant taxon that has also been sequenced, the approach I've outlined above won't work. But, bottom line, I don't see how in practice we get away from needing a large tree in order to sensibly query TreeBASE. In which case, the Phylocode makes little substantive difference, contra some of David's comments.