Tuesday, January 30, 2007


Comments by David Marjanović elsewhere on this blog (here and here) about TreeBASE, classification and Phylocode have prompted me to write a little bit about why I'm underwhelmed by the Phylocode. Suppose I have the question:
"find me all studies in TreeBASE that contain birds"?

How do I answer this? Well, my approach is to do the following. Firstly, I attempt to map every name in TreeBASE onto a name in an external database, such as NCBI Taxonomy, uBio, etc. Once I've done this, I download the NCBI taxonomic classifcation, and use it to query the mapped names.

Querying a classification
The basic idea for querying trees is nicely explained by Aaron Mackey in his article Relational Modeling of Biological Data: Trees and Graphs, and I've used this approach elsewhere in the Glasgow Taxonomic Name Server. You take the tree, compute left and right visitation numbers, and then use those numbers to query the data base. For more details take a look at my notes here. For example, given the tree below (taken from Aaron's article) let's imagine that node 4 (shown in blue in the diagram) is "Aves".

Then, if I search for nodes with a left_id > 3, and a right_id < 10, I get nodes 10, 11, and 12. These are the birds. So, I find each name in TreeBASE that maps onto these nodes, and those are the birds in TreeBASE.

Phylogenetic nomenclature
How would I do this using phylogenetic nomenclature? Well, I can't just use the specifiers of a name, because there is no guarantee that those specifiers will be in the tree. For example, Paul Sereno's Taxon Search site lists the definition of "Aves" as:
The least inclusive clade containing Archaeopteryx lithographica Meyer 1861 and Passer domesticus (Linnaeus 1758).

Neither taxon occurs in TreeBASE to date! So, how would I search for birds (and be confident that I can retrieve all the studies on birds)?
Well, if I had a larger tree (such as a supertree or, say, the NCBI classification) that had the specifiers (i.e., in this case it had Archaeopteryx and the sparrow), I could find the least common ancestor (LCA) or those two taxa on the tree, that node would be "Aves", and then I can use the technique described above to find all studies on birds.

So, I'm underwhelmed. In practice the approach is the same, use a large tree, locate the node from which all your taxa descend, and find studies with those taxa. Using a classification such as NCBI, I have a large tree complete with internal nodes labelled. Hence, to find all studies containing birds I find the node labelled "Aves" and do the query. To use phylogenetic names, I also need a large tree, and I need to look up the LCA of the specifiers, then do the query in the same way. So, in practise the difference is minor, although for phylogenetic names there is the issue of what tree to use. One could argue that the classification approach is ready to go - just grab the NCBI tree.

Now, there are problems of course. For palaeontologists, the nearly complete lack of extinct taxa in the NCBI tree is a problem, because unless a TreeBASE study has at least one extant taxon that has also been sequenced, the approach I've outlined above won't work. But, bottom line, I don't see how in practice we get away from needing a large tree in order to sensibly query TreeBASE. In which case, the Phylocode makes little substantive difference, contra some of David's comments.


David Marjanović said...


let's imagine that node 4 (shown in blue in the diagram) is "Aves".

But why? How do we know that this node is Aves?

Maybe by applying a phylogenetic definition to a supertree yet again?

At least, under the PhyloCode, there will be one node or branch that will be Aves. Without it, everyone can use that name as they damn well please, and indeed several different usages of Aves exist in the current literature. Do I take Aves sensu Chiappe (the definition Sereno cites)? Do I take Aves sensu Gauthier and friends (the crown-group -- with Vultur gryphus, the Andean condor, as a specifier, which probably isn't in TreeBASE either)? Do I use an apomorphy, like Benton does (flight, wing feathers, whatever), or a traditional concept (an unspecified number of a list of diagnostic character states is to be present)? Sure, if we restrict ourselves to the living, all those concepts certainly describe the exact same contents. But not everyone is a neontologist. Without the PhyloCode, the proper answer to "find me all birds" is "what do you mean by 'bird'". With it, we will still get answers like "it's unclear if Rahonavis is a bird", but at least such uncertainties will reflect uncertainties about the phylogeny, and nothing but the phylogeny, 1 : 1. -- If we take another example than the birds, it becomes clear that the division is not between palaeo- and neontologists, but between clades with certain vs uncertain contents. As an example I can offer Lophotrochozoa. Do the flatworms belong or not?

One could argue that the classification approach is ready to go - just grab the NCBI tree.

You will be hard-pressed to find one ornithologist who uses exactly the NCBI classification. Every systematist who uses Linnaean nomenclature has their own classification and modifies it whenever they see fit. I have never seen two publications that use the exact same classification for the same organisms, no matter which ones, and no matter if they agree on the phylogeny. If you, in effect, want to impose the NCBI classification on everyone, most will stage a rebellion against this infringement of their taxonomic freedom (a term that is among the principles of the ICZN), just like how so many people dislike DNA barcoding -- regardless of its feasibility -- for imposing a phenetic species concept on everyone. That's different under phylogenetic nomenclature -- if applied to a phylogeny, phylogenetic definitions tell you what belongs to which taxon under the assumption of that particular phylogeny. Half of the confusion is taken away; splitting and lumping are abolished.

In sum, I seem agree with you: to answer "find me all TreeBASE studies that include birds", we need a supertree and some kind of definition, and the algorithm does not differ. It's just that the PhyloCode will provide such a definition for (eventually) every clade name, while currently we must choose between different classifications, even if they are all derived from the same phylogeny.

The problem stays as it is. Phylogenetic nomenclature just makes it smaller.

Rod Page said...

Regarding "If you, in effect, want to impose the NCBI classification on everyone, most will stage a rebellion against this infringement of their taxonomic freedom", I'm not too worried because we can use another classification if we prefer, and I doubt there will be a rebellion -- does anybody not use GenBank because their "taxonomic freedom" is infringed? We might grumble about it, but we use it.

Yes, I know NCBI taxonomy has major problems, and indeed I've looked at ways to accommodate alternatives in a paper (doi:10.1186/1471-2105-6-208) I blogged here.

I don't really want to get into a squabble about the merits of Phylocode. It's just that right now, it doesn't offer me a great deal. I should also point out that I view classification as purely a tool for navigating and finding things. The whole "classification reflects phylogeny" debate doesn't do much for me.

However, I hope I'm still being open minded. What would be very cool is if there was a way to bypass the supertree bit. Can we have a set of trees like those in TreeBASE, perhaps linked in phylogenetic groves (see work by Cécile Ané here and here), and perhaps something like Bender et al.'s work on and the problem of LCA's in directed acyclic graphs doi:10.1016/j.jalgor.2005.08.001 and use phylogenetic names to find trees? That would be an interesting question, perhaps more in the spirit of tree surfing.

David Marjanović said...

Ignorant question: The NCBI classification isn't comprehensive, is it? Does it include all 10,000 species of extant birds? If not, we can't escape the supertree problem by just using that classification.

Thanks for the many links, I'll read them ASAP.

Rod Page said...

No, it's not comprehensive. As of today it has 8,238 avian taxa, a good number of which are higher taxa (take a look at here). The leaves of the NCBI tree are just those taxa that have been sequenced. However, for querying TreeBASE, all we need is that at least one of the birds in any given study has been sequenced, and we can find it.

The real "problem" studies would be those that do not include any sequenced taxa, for example, a study of entirely extinct taxa. Trevor Cotton's study of blind trilobites (TreeBASE study S723, doi:10.1111/1475-4983.00176) is a case in point.