iPhylo: What would people want/expect from taxon searching on TreeBASE?

Roderic D. M. Page

Sunday, March 01, 2009

What would people want/expect from taxon searching on TreeBASE?

Rutger Vos asked on Twitter "What would people want/expect from taxon searching on TreeBASE?". This is a good question, and one which motivated the work I did on TBMap (see doi:10.1186/1471-2105-8-158), which developed a mapping between TreeBASE taxa and other databases. In that paper I published a table showing the effectiveness of string and hierarchical queries of TreeBASE. A string query returned just the studies that included the query term, a hierarchical query returned all studies that contained taxa that descend from a given node in the NCBI taxonomy.

A hierarchical query can be visualised as a range query. For example, the diagram below shows a classification where the descendants of Node A correspond to the range 1-8 (this is a simplification of visitation numbers, see my earlier post, and also Chen et al. doi:10.1186/1471-2148-8-90). The three trees can be represented as the ranges 3-8, 6-7, and 2-4, respectively. To find trees for the taxon corresponding to Node A we look for trees whose range intersects 1-8, to find trees corresponding to Node B we look for trees whose range intersects 1-5.

This approach retrieves a list of all trees that include a given taxon, but potentially this list could be very large (for example, a query for all plant trees could return 1000's of trees). So the question becomes how to order these trees? Some ideas:

order by the number of taxa in the tree.
order by the size of the range ("span") of the tree.
order by set inclusion.

Ordering by size seems attractive, but will favour trees with more taxa over those with a greater taxonomic spread (which is favoured by ordering by the span of the tree). I used span in my challenge entry to display papers with related taxa.

What I have in mind for ordering by set inclusion is constructing a directed graph where each node represents a tree, and a pair of nodes (x, y) is connected by an edge x → y if the taxa in tree y are nested inside the taxa in tree x. We could also introduce some additional nodes corresponding to nodes in the classification. If we then topologically sort the graph we have a linear order for the trees. Given that this order could be pre-computed independently of any queries (in much the same way that PageRank is), it could make for faster queries.

It would be useful to explore these and other ordering criteria. Perhaps the best approach would be some measure which combines one or more criteria, in which case we might want to use some form of rank aggregation (see iSpecies clones, and taxonomic intelligence for some links to the relevant literature).