A hierarchical query can be visualised as a range query. For example, the diagram below shows a classification where the descendants of Node A correspond to the range 1-8 (this is a simplification of visitation numbers, see my earlier post, and also Chen et al. doi:10.1186/1471-2148-8-90). The three trees can be represented as the ranges 3-8, 6-7, and 2-4, respectively. To find trees for the taxon corresponding to Node A we look for trees whose range intersects 1-8, to find trees corresponding to Node B we look for trees whose range intersects 1-5.
This approach retrieves a list of all trees that include a given taxon, but potentially this list could be very large (for example, a query for all plant trees could return 1000's of trees). So the question becomes how to order these trees? Some ideas:
- order by the number of taxa in the tree.
- order by the size of the range ("span") of the tree.
- order by set inclusion.
Ordering by size seems attractive, but will favour trees with more taxa over those with a greater taxonomic spread (which is favoured by ordering by the span of the tree). I used span in my challenge entry to display papers with related taxa.
What I have in mind for ordering by set inclusion is constructing a directed graph where each node represents a tree, and a pair of nodes (x, y) is connected by an edge x → y if the taxa in tree y are nested inside the taxa in tree x. We could also introduce some additional nodes corresponding to nodes in the classification. If we then topologically sort the graph we have a linear order for the trees. Given that this order could be pre-computed independently of any queries (in much the same way that PageRank is), it could make for faster queries.
It would be useful to explore these and other ordering criteria. Perhaps the best approach would be some measure which combines one or more criteria, in which case we might want to use some form of rank aggregation (see iSpecies clones, and taxonomic intelligence for some links to the relevant literature).