Thursday, August 14, 2014

Seven percent of GBIF data is usable - quick thoughts on Hjarding et al. 2014

Update: Angelique Hjarding and her co-authors have responded in a guest post on iPhylo.

The quality and fitness for use of GBIF-mobilised data is a topic of interest to anyone that uses GBIF data. As an example, a recent paper on African chameleons comes to some rather alarming conclusions concerning the utility of GBIF data:

Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427

Here's the abstract (unfortunately the paper is behind a paywall):

The IUCN Red List of Threatened Species uses geographical distribution as a key criterion in assessing the conservation status of species. Accurate knowledge of a species’ distribution is therefore essential to ensure the correct categorization is applied. Here we compare the geographical distribution of 35 species of chameleons endemic to East Africa, using data from the Global Biodiversity Information Facility (GBIF) and data compiled by a taxonomic expert. Data screening showed 99.9%of GBIF records used outdated taxonomy and 20% had no locality coordinates. Conversely the expert dataset used 100%up-to-date taxonomy and only seven records (3%) had no coordinates. Both datasets were used to generate range maps for each species, which were then used in preliminary Red List categorization. There was disparity in the categories of 10 species, with eight being assigned a lower threat category based on GBIF data compared with expert data, and the other two assigned a higher category. Our results suggest that before conducting desktop assessments of the threatened status of species, aggregated museum locality data should be vetted against current taxonomy and localities should be verified. We conclude that available online databases are not an adequate substitute for taxonomic experts in assessing the threatened status of species and that Red List assessments may be compromised unless this extra step of verification is carried out.

The authors used two data sets, one from GBIF, the other provided by an expert to compute the conservation status for each chameleon species endemic to Kenya and/or Tanzania. After screening the GBIF data for taxonomic and geographic issues, a mere 7% of the data remained - 93% of the 2304 records downloaded from GBIF were discarded.

This study raises a number of questions, some of which I will touch on here. Before doing so, it's worth noting that it's unfortunate that neither of the two data sets used in this study (the data downloaded from GBIF, and the expert data set assembled by Colin Tilbury) are provided by the authors, so our ability to further explore the results is limited. This is a pity, especially now that citable data repositories such as Dryad and Figshare are available. The value of this paper would have been enhanced if both datasets were archived.

Below is Table 1 from the paper, "Museums from which locality records for East African chameleons were obtained for the expert and GBIF datasets":

MuseumExpert datasetGBIF
Afrika Museum, The Netherlandsx
American Museum of Natural History, USAx
Bishop Museum, USAx
British Museum of Natural History, UKx
Brussels Museum of Natural Sciences, Belgiumx
California Academy of Sciences, USAx
Ditsong Museum, South Africaxx
Los Angeles County Museum of Natural History, USAx
Museum für Naturkunde, Germanyx
Museum of Comparative Zoology (Harvard University), USAx
Naturhistorisches Museum Wien, Austriax
Smithsonian Institution, USAx
South African Museum, South Africax
Trento Museum of Natural Sciences, Italyx
University of Dar es Salaam, Tanzaniax
Zoological Research Museum Alexander Koenig, Germanyx

It is striking that there is virtually no overlap in data sources available to GBIF and the sources used by the expert. Some of the museums have no presence in GBIF, including some major collections (I'm looking at you, The Natural History Museum), but some museums do contribute to GBIF, but not their herpetology specimens. So, GBIF has some work to do in mobilising more data (Why is this data not in GBIF? What are the impediments to that happening?). Then there are museums that have data in GBIF, but not in a form useful for this study. For example, the American Museum of Natural History has 327,622 herpetology specimens in GBIF, but not one of these is georeferenced! Given that there are records in GenBank for AMNH specimens that are georeferenced, I suspect that the AMNH collection has deliberately not made geographic coordinates available, which raises the obvious question - why?

GBIF coverage

I had a quick look at GBIF to get some idea of the geographic coverage of the relevant herpetology collections (or animal collections if herps weren't separated out). Below are maps for some of these collections. The AMNH is empty, as is the smaller Zoological Research Museum Alexander Koenig collection (which supplied some of the expert data).

American Museum of Natural History, USA

Bishop Museum, USA

California Academy of Sciences, USA

Ditsong Museum, South Africa

Los Angeles County Museum of Natural History, USA

Museum für Naturkunde, Germany

Museum of Comparative Zoology (Harvard University), USA

Smithsonian Institution, USA

Zoological Research Museum Alexander Koenig, Germany

Some collections are relevant, such as the California Academy of Sciences, but a number of the collections in GBIF simply don't have georeferenced data on chameleons. Then there are several museums that are listed as sources for the expert database and which contribute to GBIF, but haven't digitised their herp collections, or haven't made these available to GBIF.


The other issue encountered by Hjarding et al. 2014 is that the GBIF taxonomy for chameleons is out of date (2302 of 2304 GBIF-sourced records needed to be updated). Chameleons are a fairly small group, and it's not like there are hundreds of new species being discovered each year (see timeline in BioNames), 2006 was a bumper year with 12 new taxonomic names added. But there has been a lot of recent phylogenetic work which has clarified relationships, and as a result species get shuffled around different genera, resulting in a plethora of synonyms. GBIF's taxonomy has lagged behind current research, and also manages to horribly mangle the chameleon taxonomy is does have. For example, the genus Trioceros is not even placed within the chameleon family Chamaeleonidae but is simply listed as a reptile, which means anyone searching for data on the family Chamaeleonidae will all the Trioceros species.


The use case for this study seems one of the most basic that GBIF should be able to meet - given some distributions of organisms, compute an assessment of their conservation status. That GBIF-mobilised data is so patently not up to the task in this case is cause for concern.

However, I don't see this is simply a case of expert data set versus GBIF data, I think it's more complicated than that. A big issue here is data availability, and also the extent of data release (assuming that the AMNH is actively withholding geographic coordinates for some, if not most of its specimens). GBIF should be asking those museums that provide data why they've not made georeferenced data available, and if its because the museums simply haven't been able to do this, then how can it help this process? It should also be asking why museums which are part of GBIF haven't mobilised their herpetology data, and again, what can it do to help? Lastly, in an age of rapid taxonomic change driven by phylogenetic analysis, GBIF needs to overhaul the glacial pace at which it incorporates new taxonomic information.