Wednesday, November 29, 2006

Homonyms and uBio's data model (yet more on names)

As part of the TreeBASE name mapping exercise, I've come across some interesting names, such as "Diplura". This is a homonym, meaning that more than one taxon has this name. This can complicate life somewhat.

In TreeBASE, the taxon Diplura is a spider genus (TreeBASE taxon T4182), part of the study by Fredrick Coyle (hdl:2246/1665).

NCBI has "Diplura" (Taxonomy ID (29997), but this is the insect class (or order, depdnign on what classification you use). NCBI mistakenly links "Diplura" in NCBI to "Diplura" in TreeBASE, but links correctly to the insect record in ITIS (Taxonomic Serial No:99228).

To make matters worse, there is also an algal genus Diplura, which ITIS also has (Taxonomic Serial No:10873).

The problem comes when we look up this name in uBio. The name Diplura is listed as appearing in several classifications, including NCBI, ITIS, etc., as well as its occurrence as a butterfly name (Diplura Ranbur, 1866). However, in the metadata for this name there is the tag <ubio:taxonomicGroup>Phaeophyceae</ubio:taxonomicGroup> (the Phaeophyceae are algae). Clearly, a name that is used by a spider, an insect, and an alga (never mind a butterfly) can't be assigned to a single taxonomic group. Perhaps one solution would be have multiple instance of the <ubio:taxonomicGroup> tag, one for each major taxonomic group the name came from.

My motivation in all this is to start thinking about taxonomic names as simple "tags", with a view to using some of the vocabularies for taxonomies and "folksonomies" geing developed elsewhere, such as SKOS Core. Under this approach, I'd need GUIDs for name strings, independent of their usage. uBio pretty much does this, but for the <ubio:taxonomicGroup> tag.


David Marjanović said...

Clearly, the spider genus and the butterfly genus can't keep the same name, one of them must change according to the ICZN. Without the PhyloCode, however, we'll probably never get rid of the homonymy with the brown alga genus and the insect order.

What is a GUID?

Rod Page said...

The butterfly name is a junior homonym, the currently accepted genus name is Psilogaster, according to the NHM LepIndex, so the problem has been addressed.

GUID stands for Globally Unique Identifier. Examples include DOIs you see on publications, handles (such as the one in the original post, and LSIDs. For some background see the TDWG-GUID page, and my paper Taxonomic names, metadata, and the Semantic Web.

David Marjanović said...

Many thanks for your paper!

In that paper you write:

"The ultimate goal is to be able to query the TreeBASE database with a taxonomic name, and recover all the studies that contain that taxon."

I wonder... why bother with TreeBASE?

I mean, it's a good thing that TreeBASE exists, and it's also good that Systematic Biology requires that authors submit their trees and matrices to it, but still, it's unfortunately a completely marginal phenomenon, compared to, say, GenBank, or even to the Nomenclator Zoologicus (with its patchy coverage of fossils). For example, most palaeontologists seem never to have heard of it; you will notice a general lack of Mesozoic dinosaurs in it (a field where ever larger cladograms have been produced since 1984, two years ago reaching, IIRC, 75 taxa and 738 characters for an analysis of just the carnivorous ones, and where all workers except maybe one are today cladists).

BTW, is it programmers' custom to use "schema" as the plural of "scheme" (maybe like how sinologists distinguish "rime" and "rhyme")? In Greek, "schema" is the singular, and its plural is "schemata"... ~:-|

Rod Page said...

Why bother with TreeBASE? This is a good question, and I've made my feelings about TreeBASE clear elsewhere (TreeBASE rocks and TreeBASE talk at CIPRES.

Yes, to some extent TreeBASE is marginal, but the comparisons aren't terrible meaning. GenBANK is a massively funded database (US federal funds, with replication in Japan and Europe), and Nomenclator Zoologicus is essentially an OCR'ed list of names - a valuable resource, but nowhere near as complicated as TreeBASE.

I want a dataabse of phylogenies that can be queried from mutliple perspectives (i.e., by taxon, author, publication, geographic locality, time slice, etc.). Much of my current research can be summarised as asking the question "what would it take to make this possible?" The fact that "most palaeontologists seem never to have heard of it" suggest two things: 1) palaeontologists aren't following phylogenetic database efforts, and 2) people interested in phylogenetic database efforts haven't succeeded in engaging palaeontologists -- neither situation is satisfactory. In the same way, it is striking that palaeontological names are largely missing from databases of taxonomic names (e.g., uBio, see also Connotea and TreeBASE. There seems to be a fundamental disconnect between neontologists and palaeontologists in this area.

Lastly, regarding "schema" and "schemata", it's too late in the day for me to offer much on this, expect to say that I think "schema" is indeed used in the singular -- it's a model or plan of a database.

David Marjanović said...

I see, thanks!

On the disconnect I can only offer the hypothesis that the neontologists don't know the palaeontologists exist... that was worse in the past (remember the times when some people said fossils shouldn't even be used in phylogenetic analyses because they introduce so many question marks to the data matrix?), but still.

David Marjanović said...

Sorry, not 738 characters, only 438 (and I hear many of those are correlated).