Wednesday, March 18, 2009

London Calling

Busy day yesterday, giving two talks, one at The Natural History Museum, one at the British Library. Slides for the NHM talk are below. Karen James pointed out the irony that a talk where I gave the NHM a hard time for being backward about embracing digitisation can't be viewed on most PCs at the NHM because SlideShare requires a recent version of Flash (which users can't install without IT's permission), and the downloaded presentation won't open because the NHM uses an older version of MS Office. So much for my attempts to share the slides. There will also be a video available at some point.

The second presentation was at the British Libraries "Talk Science" series, for some background see the forum on Nature Network. There will be a podcast available of this presentation. In her introduction to my talk, Sarah Kemmitt quoted from a recent paper by Antonio G. Valdecasas ([JACC]1175-5326:1820@41 where he described Vagabundia sci:
Vagabundia comes from the Spanish word 'vagabundo' that means 'wanderer'. It is a feminine substantive; sci refers to Science Citation Index. We pointed out some time ago (Valdecasas et al. 2000) that the popularity of the Science Citation Index (SCI) as a measure of ‘good’ science has been damaging to basic taxonomic work. Despite statements to the contrary that SCI is not adequate to evaluate taxonomic production (Krell 2000), it is used routinely to evaluate taxonomists and prioritize research grant proposals. As with everything in life, SCI had a beginning and will have an end. Before it becomes history, I dedicate this species to this sociological tool that has done more harm than good to taxonomic work and the basic study of biodiversity. Young biologists avoid the 'taxonomic trap' or becoming taxonomic specialists (Agnarsson & Kuntner 2007) due to the low citation rate of strictly discovery-oriented and interpretative taxonomic publications. Lack of recognition of the value of these publications, makes it difficult for authors to obtain grants or stable professional positions.

My own feeling is that SCI probably does a reasonable job of ranking the impact of taxonomic publication, the real task is to broaden our notion of what gets cited.

Friday, March 06, 2009

Phylogenies in a wiki

I'm slowing trying to get phylogenies into the wiki that I'm playing with. Here's an early example, the TreeBASE tree T6002, from the study A Phylogenomic Study of Birds Reveals Their Evolutionary History. The tree is displayed using my tvwidget. Below are listed the OTUs in the tree in a crude table. The idea is that this table will contain a mapping between OTU labels and taxa. For example, one OTU is labelled "Diomedea nigripes". This links to the page for Diomedea nigripes, which provides some basic information on this name, including the statement that ITIS regards the correct name for this bird to be Phoebastria nigripes (Audubon, 1839). The table on the page showing the tree displays this information as well.

This is all terribly incomplete and crude, but it gives a sense of where this is going. The plan is to import in bulk the trees and the mappings (from, say, TBMap), as well as the names themselves, and associated literature (including the TreeBASE studies) and then the trees will be embedded in richer data about the taxa.

Photosynth for trees - supertrees revisited

It's Friday, so time for some random, half-baked ideas. Imagine that we have a database of evolutionary trees, and these overlap for a set of taxa that we are interested in. How do we summarise these trees? One approach is to make a supertree. It would be useful to display the subtrees that went into making this supertree, if only to give an idea of how much they agree with the supertree. How to do this?

One idea I've been toying with is inspired by Photosynth, from Microsoft labs (it only runs on Windows, sigh). Photosynth takes a series of pictures taken from different angles and stiches them together into a 3D model of the object being photographed:

One thing I like about Photosynth is that you can see the original pictures, so when you move around the view you get a sense of how they have contributed to the overall view. This is easier to see than explain:

Now, imagine if we did this with trees. We could create a supertree as a summary of the individual trees, then have the original trees layered on top. Perhaps we could do this in 3D, so that each individual tree is in a plane that is tilted with respect to the supertree in proportion to how much it disagrees with the supertree:

I think this could be a fun way to explore a set of trees, and it would give one the ability to quickly grasp how well the source trees agreed with the supertree. Note that I'm not (necessarily) arguing that the supertree represents the try phylogeny. Think of it as a convenient way to summarise the individual trees.

Part of what attracts me to this approach is that I think most, if not all, 3D phylogeny viewers (such as Paloverde and the Wellcome tree of life) don't make any real use of 3D, beyond the rather gimmicky (and I find ultimately confusing) ability to fly around a 2D tree. Is there a better way to exploit the possibilities of 3D?

Sunday, March 01, 2009

What would people want/expect from taxon searching on TreeBASE?

Rutger Vos asked on Twitter "What would people want/expect from taxon searching on TreeBASE?". This is a good question, and one which motivated the work I did on TBMap (see doi:10.1186/1471-2105-8-158), which developed a mapping between TreeBASE taxa and other databases. In that paper I published a table showing the effectiveness of string and hierarchical queries of TreeBASE. A string query returned just the studies that included the query term, a hierarchical query returned all studies that contained taxa that descend from a given node in the NCBI taxonomy.

A hierarchical query can be visualised as a range query. For example, the diagram below shows a classification where the descendants of Node A correspond to the range 1-8 (this is a simplification of visitation numbers, see my earlier post, and also Chen et al. doi:10.1186/1471-2148-8-90). The three trees can be represented as the ranges 3-8, 6-7, and 2-4, respectively. To find trees for the taxon corresponding to Node A we look for trees whose range intersects 1-8, to find trees corresponding to Node B we look for trees whose range intersects 1-5.

This approach retrieves a list of all trees that include a given taxon, but potentially this list could be very large (for example, a query for all plant trees could return 1000's of trees). So the question becomes how to order these trees? Some ideas:
  1. order by the number of taxa in the tree.
  2. order by the size of the range ("span") of the tree.
  3. order by set inclusion.

Ordering by size seems attractive, but will favour trees with more taxa over those with a greater taxonomic spread (which is favoured by ordering by the span of the tree). I used span in my challenge entry to display papers with related taxa.

What I have in mind for ordering by set inclusion is constructing a directed graph where each node represents a tree, and a pair of nodes (x, y) is connected by an edge xy if the taxa in tree y are nested inside the taxa in tree x. We could also introduce some additional nodes corresponding to nodes in the classification. If we then topologically sort the graph we have a linear order for the trees. Given that this order could be pre-computed independently of any queries (in much the same way that PageRank is), it could make for faster queries.

It would be useful to explore these and other ordering criteria. Perhaps the best approach would be some measure which combines one or more criteria, in which case we might want to use some form of rank aggregation (see iSpecies clones, and taxonomic intelligence for some links to the relevant literature).