Tuesday, September 23, 2014

Exploring the chameleon dataset: broken GBIF links and lack of georeferencing

Following on from the discussion of the African chameleon data, I've started to explore Angelique Hjarding's data in more detail. The data is available from figshare (doi:10.6084/m9.figshare.1141858), so I've grabbed a copy and put it in github. Several things are immediately apparent.

  1. There is a lot of ungeoreferenced data. With a little work this could be geotagged and hence placed on a map.
  2. There are some errors with the georeferenced data (chameleons in Soutb America or off the coast, a locality in Tanzania that is now in Ethiopia, etc.).
  3. Rather alarmingly, most of the URLs to GBIF records that Angelique gives in the dataset no longer resolve.

The last point is worrying, and reflects the fact that at present you can't trust GBIF occurrence URLs to be stable over time. Most of the specimens in Angelique's data are probably still in GBIF, but the GBIF occurrenceID (and hence URL) will have changed. This pretty much kills any notion of reproducibility, and it will require some fussing to be able to find the new URLs for these records.

That the GBIF occurrenceIDs are no longer valid also makes it very difficult to make use of any data cleaning I or anyone else attempts with this data. If I georeference some of the specimens, I can't simply tell GBIF that I've got improved data. Nor is it obvious how I would give this information to the original providers using, say VertNet's github repositories. All in all a mess, and a sad reflection on our inability to have persistent identifiers for occurrences.

To help explore the data I've created some GeoJSON files to get a sense of the distribution of the data. Here are the point localities, a few have clearly got issues.



I also drew some polygons around points for the same taxon, to get a sense of their distributions.

Taxa represent by less than three distinct localities are presented by place marker, the rest by polygons.

I'll keep playing with this data as time allows, and try to get a sense of how hard it would be to go from what GBIF provides to what is actually going to be useful.