iPhylo: Clustering taxonomic names

Roderic D. M. Page

Tuesday, February 24, 2009

Clustering taxonomic names

As part of my Quixotic attempt to construct a wiki of taxonomic names, I'm building a database of names and links. My current plan is to seed this with the NCBI taxonomy. What I want to do is flesh out the NCBI taxonomy with authorities and links to the original literature. At the moment the NCBI taxonomy is almost "nude", lacking links to the literature behind the names. As the magnificently bearded Geoffrey Bilder notes in an interview with Martin Fenner:

One way in which researchers assess the trustworthiness of content is by determining how it sits within the scholarly record. Does it provide evidence for its assertions in citations? Do other people cite it?

Given how important the NCBI taxonomy is, I think it would be a great improvement if each name could be linked to the original taxonomic publication. A first step to this is to find the taxonomic authority, the name of the author (or authors) of the name.

One potential source is uBio, which provides web services for retrieving information on names. Hence, an obvious approach is to map NCBI names to uBio names. However, if I use uBio's SOAP service typically I get multiple records for the same name. Some of these are due to homomyms (e.g., the same name used for a plant and an animal), but many are the same name with variations on the taxonomic authority. Much of this variation arises because uBio aggregates information from a wide range of databases, and each database differs in who it records the taxonomic authority.

For example, for the name "Diplura" (which I've discussed earlier) we get these names and authorities:

Diplura (Greene MS.) Allman 1864
Diplura Borner, 1904
Diplura C. L. Koch 1850
Diplura G. J. Hollenberg, 1969
Diplura Hollenb.
Diplura Jerdon 1864
Diplura Koch 1850
Diplura Koch 1851
Diplura Rambur 1866
Diplura Simon 1892

Before asking which of these names corresponds to "Diplura" in NCBI, I'd like to cluster these names into sets by merging names that are "the same." This resembles the problem of equivalent author names. The approach I'm using is to build a graph linking taxonomic authorities that are more similar than some threshold, then finding the components of that graph. For example, here is the graph for "Diplura":

The nodes in the graph are the taxonomic authorities, "cleaned" by making all the text lower case, and stripping any punctuation. The edges are labelled by the length of the longest common substring shared by the nodes that edge links (I ignore substrings less than four characters long). This graph groups the variations on Diplura Koch (a spider), and Diplura Hollenberg (a brown alga, see doi:10.1111/j.1529-8817.1969.tb02617.x).

Not surprisingly, perhaps, the linkouts from the NCBI taxonomy for Diplura are a mess, with the algal genus (taxon:371965 linking to both plant and animal databases, and the insect class (taxon:29997) linking to a mix of plants and animals, not all of the animals are insects.

I'm still playing with the underlying code, but I might try and build a web service that returns name clusters (and perhaps the graph as well).