Wednesday, November 06, 2013

GBIF and Github: fixing broken Darwin Core Archives

Following on from Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite here's a quick and dirty example of using GitHub to help clean up a Darwin Core Archive.

The dataset 3i - Cicadellinae Database has 2,152 species and 4,749 taxa, but GBIF says it has no georeferenced data. As a result, the map for this dataset looks like this:

Gbif 3i

I downloaded the Darwin Core Archive and was puzzled because the occurrence.txt file contained in the archive has latitude and longitude pairs for some of the records. How come there is no map? After a bit of fussing I discovered that the meta.xml file that describes the data is broken. It lists a column which doesn't appear in the data file, so everything after that column gets shifted along and hence the column headings for latitude and longitude are out of alignment with the data.

So, I loaded the Darwin Core Archive into GitHub (you can see it here), then fixed the error, and then for fun extracted the latitude and longitude pairs as a GeoJSON file. GitHub can display this on a map:

Note that we now have a fairly extensive set of georeferenced data points for these insects, and this data hasn't made it onto a GBIF map because of a simple error in the metadata. I keep finding cases like this, which suggests that GBIF has more georeferenced data than it realises.