Friday, September 20, 2013

The quality of GBIF's taxonomic classification

In some recent posts I've been exploring the quality of GBIF's taxonomic data. I've done some further analyses and decided to write this up in something more than a blog post. I'm writing a draft which you can see on GitHub. It tackles just one issue, namely what happens when you combine taxonomic names from multiple sources and don't know that some of those names are synonyms. For example, below is a cluster map for mammal species names from the Catalogue of Life, Mammal Species of the World, and the IUCN Red List.

Mammal species
Each database has a set of names that it and it alone recognises, as well as names that two of the three agree on. Merging these three sets of names successful requires knowing which are synonyms. As I've noted before some synonyms have ended up in GBIF as separate names, which can mean users get a rather distorted view of what GBIF actually knows about a species.

This issue doesn't just affect GBIF, projects like the Map of Life suffer the same problem. The gibbon example I used earlier crops up again. I had to do three separate searches of Map of Life using the three different synonyms for the hoolock gibbon to get a complete picture of our knowledge of its distribution:

The multiplicity of names for the same taxon is one of the main challenges facing anyone wanting to integrate biodiversity data, and hence this taxonomy meme seems rather appropriate: