Tuesday, February 24, 2009

Clustering taxonomic names

As part of my Quixotic attempt to construct a wiki of taxonomic names, I'm building a database of names and links. My current plan is to seed this with the NCBI taxonomy. What I want to do is flesh out the NCBI taxonomy with authorities and links to the original literature. At the moment the NCBI taxonomy is almost "nude", lacking links to the literature behind the names. As the magnificently bearded Geoffrey Bilder notes in an interview with Martin Fenner:
One way in which researchers assess the trustworthiness of content is by determining how it sits within the scholarly record. Does it provide evidence for its assertions in citations? Do other people cite it?

Given how important the NCBI taxonomy is, I think it would be a great improvement if each name could be linked to the original taxonomic publication. A first step to this is to find the taxonomic authority, the name of the author (or authors) of the name.

One potential source is uBio, which provides web services for retrieving information on names. Hence, an obvious approach is to map NCBI names to uBio names. However, if I use uBio's SOAP service typically I get multiple records for the same name. Some of these are due to homomyms (e.g., the same name used for a plant and an animal), but many are the same name with variations on the taxonomic authority. Much of this variation arises because uBio aggregates information from a wide range of databases, and each database differs in who it records the taxonomic authority.

For example, for the name "Diplura" (which I've discussed earlier) we get these names and authorities:
  • Diplura (Greene MS.) Allman 1864
  • Diplura Borner, 1904
  • Diplura C. L. Koch 1850
  • Diplura G. J. Hollenberg, 1969
  • Diplura Hollenb.
  • Diplura Jerdon 1864
  • Diplura Koch 1850
  • Diplura Koch 1851
  • Diplura Rambur 1866
  • Diplura Simon 1892
Before asking which of these names corresponds to "Diplura" in NCBI, I'd like to cluster these names into sets by merging names that are "the same." This resembles the problem of equivalent author names. The approach I'm using is to build a graph linking taxonomic authorities that are more similar than some threshold, then finding the components of that graph. For example, here is the graph for "Diplura":

The nodes in the graph are the taxonomic authorities, "cleaned" by making all the text lower case, and stripping any punctuation. The edges are labelled by the length of the longest common substring shared by the nodes that edge links (I ignore substrings less than four characters long). This graph groups the variations on Diplura Koch (a spider), and Diplura Hollenberg (a brown alga, see doi:10.1111/j.1529-8817.1969.tb02617.x).

Not surprisingly, perhaps, the linkouts from the NCBI taxonomy for Diplura are a mess, with the algal genus (taxon:371965 linking to both plant and animal databases, and the insect class (taxon:29997) linking to a mix of plants and animals, not all of the animals are insects.

I'm still playing with the underlying code, but I might try and build a web service that returns name clusters (and perhaps the graph as well).


Paul said...

Hi Rod,

I'm sure you've seen the
WikiProject Tree of Life. Have you thought about getting involved here? It looks like they've made a reasonable start.
Eg. Here is the Diplura article, which links to disambiguation page. They seem to have reasonable references for each taxonomic name - at least for the few I've looked at. None from the 1800s though. ;-) It might be worth getting involved - I'm sure they'd love to have a guru like yourself.

Best wishes,

Rod Page said...


Wikipedia has some great stuff, but I prefer a wiki based on, say, Semantic MediaWiki. This supports some basic queries, so much of the content of a wiki page can be generated automatically, rather than having to be entered by hand. I also want pages for each publication and author, which isn't what Wikipedia has in mind.

kevin said...

have you contacted NCBI on what help they can render?
seems like a worthwhile project ... esp cos with more genomes being published the taxonomic info becomes much more important!