One way in which researchers assess the trustworthiness of content is by determining how it sits within the scholarly record. Does it provide evidence for its assertions in citations? Do other people cite it?
Given how important the NCBI taxonomy is, I think it would be a great improvement if each name could be linked to the original taxonomic publication. A first step to this is to find the taxonomic authority, the name of the author (or authors) of the name.
One potential source is uBio, which provides web services for retrieving information on names. Hence, an obvious approach is to map NCBI names to uBio names. However, if I use uBio's SOAP service typically I get multiple records for the same name. Some of these are due to homomyms (e.g., the same name used for a plant and an animal), but many are the same name with variations on the taxonomic authority. Much of this variation arises because uBio aggregates information from a wide range of databases, and each database differs in who it records the taxonomic authority.
For example, for the name "Diplura" (which I've discussed earlier) we get these names and authorities:
- Diplura (Greene MS.) Allman 1864
- Diplura Borner, 1904
- Diplura C. L. Koch 1850
- Diplura G. J. Hollenberg, 1969
- Diplura Hollenb.
- Diplura Jerdon 1864
- Diplura Koch 1850
- Diplura Koch 1851
- Diplura Rambur 1866
- Diplura Simon 1892
The nodes in the graph are the taxonomic authorities, "cleaned" by making all the text lower case, and stripping any punctuation. The edges are labelled by the length of the longest common substring shared by the nodes that edge links (I ignore substrings less than four characters long). This graph groups the variations on Diplura Koch (a spider), and Diplura Hollenberg (a brown alga, see doi:10.1111/j.1529-8817.1969.tb02617.x).
Not surprisingly, perhaps, the linkouts from the NCBI taxonomy for Diplura are a mess, with the algal genus (taxon:371965 linking to both plant and animal databases, and the insect class (taxon:29997) linking to a mix of plants and animals, not all of the animals are insects.
I'm still playing with the underlying code, but I might try and build a web service that returns name clusters (and perhaps the graph as well).