Wednesday, May 30, 2012

The GBIF classification is broken — how do we fix it?

This post arose from an ongoing email conversation with Tony Rees about extracting and annotating taxonomic names. In BioStor I use the GBIF classification to display the taxonomic names found in the OCR text in the form of a tree. The idea is to give the reader a sense of "what the paper is about". I also use the classification to help link to GBIF occurrence records.

The GBIF backbone classification ("nub") is probably the single largest classification of life that has been assembled, and provides GBIF users with a way to navigate through GBIF's collection of specimen and observation records. Given the scale of the undertaking it is inevitable that there will be issues with the classification, and this post provides one example.

On the page for the article "Further additions to the known marine Molluscan fauna of St. Helena" (, see also part of the classification looks like this:

Tony points out that "Hipponyx" is a mollusc, yet in the GBIF classification appears in the annelid worms.

Like a fool I started to investigate further. First off, what is "Hipponyx"? Browsing the GBIF classification there are species of Hipponyx and Hipponix under the genus Hipponix, so it looks like we have two alternative spellings of this genus name. Nomenclator Zoologicus has both spellings, Hipponix credited to DeFrance 1819 Journ. de Physique, 88, 217, and Hipponyx credited to Defrance 1819 Bull. Sci. Soc. philom. Paris, 8. Gotta love those cryptic citations. After some digging around in BHL I found Journ. de Physique, 88, 217 (Mémoire sur un nouveau genre de mollusque) and Bull. Sci. Soc. philom. Paris, 8. (Sur un nouveau genre de coquilles (Hipponix)). Both papers are by Jacques Louis Marin DeFrance, and both use the spelling Hipponix (no 'y'). I'm guessing the second paper is actually the original description of the genus, but my French is abysmal (Google Translate to the rescue).

OK, so we have two spellings of what is probably the same thing (and I've no idea why we have two spellings). Both spellings seem in use (see Google NGrams chart below).


So, bit of a mess, but this still doesn't deal with Hipponyx being a worm in GBIF. After a bit of Googling on "Serpulidae" and "Hipponyx" I came across a specimen record from Te Papa labelled "Worm, Temporaria inexpectata (Mestayer, 1929); holotype; holotype of Hipponyx inexpectata Mestayer, 1929". I then came across this paper:

Fleming, C. A. (1971). A preliminary list of New Zealand fossil polychaetes. New Zealand Journal of Geology and Geophysics, 14(4), 742–756. doi:10.1080/00288306.1971.10426332

with the following abstract:
An annotated list of fossil “worm tubes” from New Zealand includes both published and new records from Mesozoic and Cenozoic deposits.

The binomen Zoophycos plicatus (Hutton) is proposed for the trace fossil long known as the Amuri fucoid, of unknown zoological affinity.

The following living species are recorded as New Zealand fossils for the first time: Protula bispiralis (Savigny), Salmacina dysteri (Huxley), Hydroides norvegicus Gunnerus, Pomatoceras cariniferus (Gray), P. aff. terranovae (Benham), Galeolaria hystrix (Moerch), Boccardia ? polybranchia (Haswell); new records of fossil species are Ditrupa cf. plana (Sowerby), Dorsoserpula lumbricalis (Schlotheim), and Neomicrorbis crenatostriatus (Münster). The name Hipponyx inexpectata Mestayer 1929, applied to a serpulid operculum, is used in the combination Temporaria inexpectata for a tubeworm common in deep water off New Zealand that has also been identified, with associated operculum, from the bathyal Waitotaran (Pliocene) sediments of Palliser Bay. Serpula wharjensis Wilkens and S. ougenensis Chapman are placed in Sclerostyla Moerch. Two species of Vermiliopsis and two of Spirorbis are figured but not named specifically.

The author of the paper (Charles Fleming) argues that Hipponyx inexpectata, regarded as a mollusc by its describer (Marjorie K. Mestayer, see Notes on New Zealand Mollusca. No. 4.) is actually a worm, and he moves it to the genus Temporaria.

So it seems that the reason Hipponyx has ended up being a worm in the GBIF classification is due to this synonymy.

Now, this little investigation was "fun", but took a couple of hours. Much of that was spent tracking down the literature and adding it to BioStor, which is a one-time cost. Not every issue with the GBIF classification will take this long to resolve, some cases may take longer. So there's a problem of scalability. Then there's the issue of how this information gets into the GBIF classification so we fix it (and so that people don't think Hipponyx is a worm). As has been said several times before, most eloquently by David Shorthouse, isn't it time we started using software development tools such as version control to help build, annotate, and correct classifications such as the one that underpins GBIF? That way when somebody spots an error it can be flagged, and someone with the time (and curiosity) can fix it.

Tuesday, May 15, 2012

EOL challenge draft proposal

In the spirit of the Would you give me a grant experiment? [1] here's the draft of a proposal I'm working on for the Computable Data Challenge. It's an attempt to merge taxonomic names, the primary literature, and phylogenetics into one all-singing, all-dancing website that makes it easy to browse names, see the publications relevant to those names, and see what, if anything, we know about the phylogeny of those taxa. It builds on a number of other projects I've been working on, most recently my efforts to link names to the primary literature. Comments welcome (the proposal deadline is next week).

The proposal is embedded below using Google's PDF viewer, if you can't see it try logging into your Google account, or click here.

1. The answer from NERC was a resounding "no".