Wednesday, June 12, 2013

Gibbons and GBIF: good grief what a mess

52678 580 360One reason I built BioNames (and the related digital archive BioStor) was to create tools to help make sense of taxonomic names. In exploring databases such as GBIF and the NCBI taxonomy every so often you come across cases where things have gone horribly wrong, and to make sense of them you have to drill down into the taxonomic literature.

It's becoming increasingly clear to me that large parts of the GBIF classification that underpins their data portal is, well, a mess. There are duplicate taxa, homonyms, orphan genera, and so on. Now, building a global taxonomy on the scale of GBIF is a tough problem. They are merging a lot of individual classifications into an overall synthesis. That would be a challenging problem in itself, but it's compounded by inconsistent use of names for the same taxon. In other words, synonymy. This is the greatest self-inflicted wound in taxonomy, the desire to have names be meaningful in terms of relationships (i.e., species in the same genus should be related). If you require that, then the consequence is a mess (unless you have a really good taxonomic database in place to track name changes, and we don't).

As an example, consider the White-browed Gibbon (shown here in an image from EOL). In GBIF this taxon occurs in at least three different places in the GBIF classification (each name has occurrence data associated with it):

GBIF idNameSourceOccurrences
5219549Hylobates hoolock (Harlan, 1834)The Catalogue of Life, 3rd January 2011141
4267262Bunopithecus hoolock Harlan, 1834Mammal Species of the World, 3rd edition2
5786121Hoolock hoolock (Harlan, 1834)IUCN Red List of Threatened Species3

To keep things simple I've omitted the subspecies (such as Bunopithecus hoolock hoolock). Note that three key resources for names (the Catalogue of Life, Mammal Species of the World, and the IUCN) can't agree on what to call this ape. The names are also not entirely consistent. For example, as written, Bunopithecus hoolock Harlan, 1834 (from Mammal Species of the World, 3rd edition) would imply that this was the original name for this gibbon (because the authority [Harlan, 1834] is not in parentheses). This is incorrect, the original name of the White-browed Gibbon is Simia hoolock, and you can see the original description in BioStor:

Harlan R (1834) Description of a Species of Orang, from the north-eastern province of British East India, lately the kingdom of Assam. Transactions of the American Philosophical Society 4: 52–59.
Since then it has been shuffled around various genera, including a genus (Hoolock) for which it is the type species:
Mootnick A, Groves C (2005) A new generic name for the hoolock gibbon (Hylobatidae). International Journal of Primatology 26(4): 971–976. doi: 10.1007/s10764-005-5332-4.
GBIF regards all three names as being different taxa, despite all being names for the same gibbon. The practical consequence of this is that anyone seeking a comprehensive summary of what GBIF knows about the White-browed Gibbon is going to get different data depending on which name they use. In my experience this is not an uncommon occurrence (bats as another case where the GBIF classification is a terrible hodgepodge).

My goal here is not to berate GBIF, they are trying to aggregate messy, inconsistent data on a massive scale. But we need tools to flag cases like this poor gibbon, and ways to ensure that once we've found a problem it is fixed once and for all.