Thursday, January 18, 2007

The joys of mapping names in TreeBASE

Here's a fun example of how databases get out of sync, making them harder to link up. TreeBASE taxon T4628 is labelled Bolitoglossa sombra, which doesn't exist in NCBI's taxonomy database, which is odd as the study by Mueller et al. (S1139) is a molecular phylogeny (doi:10.1073/pnas.0405785101), and the taxon concerned has had its whole mitochondrial genome sequenced. In the paper this taxon is listed as "Bolitoglossa sp. nov." and in NCBI's database it is Bolitoglossa n. sp. RLM-2004 (taxid 291262).

So why Bolitoglossa sombra in TreeBASE?
Well, Googling finds Darrel Frost's Amphibian Species of the World page on this species, which lists the name "Bolitoglossa sombra Hanken, Wake, and Savage, 2005, Copeia, 2005: 234.". Googling again finds a PDF of this paper linked to from David Wake's web site, and by Googling on "Copeia" and "BioOne" I get a DOI to the paper (doi:10.1643/CH-04-083R1.

Reading the paper doesn't make me any the wiser1, until I get the supplementary information for Mueller et al. and discover that Bolitoglossa sp. nov. is specimen MVZ 225875. Searching the PDF of Hanken et al., I find (p. 236)
Three juveniles (MVZ 225875–76, 225878) were generally black but had some obscure whitish patches, which were most evident near the tail base.

So, MVZ 225875 is Bolitoglossa sombra. I confirm this by doing a DiGIR lookup on MVZ 225875 using a script I wrote (doing this is an absolute pain because of the way DiGIR is constructed, and because if doesn't provide resolvable identifiers for specimens). You can view the specimen record directly at MVZ.

1Doh! If I'd read the paper properly, MVZ 225875 is listed as one of the paratypes of Bolitoglossa sombra.

What's your point?
My point is this is a lot of tedium to go through to link up the following items:
  1. TreeBASE taxon
  2. NCBI taxonomic record
  3. NCBI genomic record
  4. Publication of scientific name
  5. Specimen sequenced

Each of these records exist and have identifiers (of varying utility), and in the case of all but the TreeBASE record, there are ways to retrieve metadata about the record in XML format. Yet, these records exist in isolation and haven't been linked, which means I cannot easily connect them just by looking at each database. For example, looking at NCBI's record for Bolitoglossa n. sp. RLM-2004, I have no idea that this amphibian has a phylogeny in TreeBASE, or has been described in the scientific literature and is now called Bolitoglossa sombra.
This is just crazy, and is not that hard to fix if we have globally unique identifiers for digital records that are resolvable, and ways to harvest metadata about those records. For more rants and examples on this theme see SemAnt.

No comments: