Friday, November 01, 2013

Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite

E9815d877cd092a19918df74e04f0415GbifTwittergithub2TwitterThis is a quick sketch of a way to combine existing tools to help clean and annotate data in GBIF, particularly (but not exclusively) occurrence data.


The data provider puts a Darwin Core Archive (expanded, not zipped) into a GitHub repository. GBIF forks the repository, cleans the data, and uploads that to GBIF to populate the database behind the portal.


When GBIF firsts loads the repository it assigns it a DOI (using, say, DataCite). Actually we assign two DOIs, one for this version of the data (e.g., 10.1234/data.v1) and one for all versions of the data, say 10.1234/data. The data is considered to be published, authorship is determined by the provider, which may be an individual, a project, an institution, etc.

Big scale annotation and cleaning

Anyone familiar with GitHub can fork the repository of data and do their own cleaning (e.g., fixing dates, latitudes and longitudes, links to taxon names, etc.).

Small scale, casual annotation

Anyone visiting the GBIF portal and noticing an error (or something that they want to comment on) does so on the portal. Behind the scenes these comments are stored as issues on the GBIF repository in GitHub. To do this GBIF can either (a) enable users with an existing GitHub account to link that to their GBIF user account, or (b) create a GitHub account for the user. The user need not actually interact directly with GitHub (a similar approach is described by Mark Holder for the social curation of phylogenetic studies).

This means all annotation, big or small, is in the open and on GitHub. There is very little programming to do, GBIF simply talks to GitHub using GitHub's API. GBIF could display known "issues" for a dataset, so portal users immediate know if any data has been flagged as problematic.

All the annotations belong to the "community", in the sense that each annotation is linked to GitHub user (even if the user might not ever actually go to GitHub). This also means that the provider can, at any point, pull in those annotations so they can update their own data (and hence gain direct benefit form exposing it in the first place).


When GBIF decides that enough annotations have been made and resolved, the latest version of the repository is loaded into GBIF and gets a new DOI (e.g., 10.1234/data.v2). This means an analysis based on that version is citable. We add a link to the overall DOI so someone who doesn't care about versions can still cite the data.

Authorship and credit

Now we come to the fun part. The revision will include the input from a bunch of people. This will be recorded on GitHub, but that will only mean something the handful of geeks who think GitHub is awesome. But, let's imagine that we do the following:

  1. Anyone with a GBIF account can link that to their ORCID (if you are a researcher you really should have one of these).
  2. Anyone contributing to this version of the repository gets authorship (appended to the end of the list, so the original provider is first author).
  3. GBIF uses the ORCID API to automatically load the DOI of the new version of the dataset onto the list of works for each contributor. They instantly get credit as an co-author of a citable dataset, and this appears on their ORCID profile.


This approach has a number of benefits:
  1. It creates citable data
  2. It gives credit in a way many people will recognise (authorship of a citable work that has a DOI)
  3. The annotations are freely available, there is a complete version history, anyone can contribute at whatever scale suits them.
  4. Anyone can grab the repo at any time and load it into their own system, including the original provider, who can see what people are added to their original data.
  5. There is virtually no programming to do, no new domain-specific protocols, everything is pretty much in place. GitHub does versioning, DataCite does citable identifiers, ORCID handles identify and credit.


There are a couple of potential issues. Darwin Core Archive data files can be large, and GitHub can be less effective with large files (although it is ideally suited to the delimited-text files that Darwin Core Archive uses, see Git (and Github) for Data). One approach to impose a limit on the size of an individual "occurrence.txt" file in the archive, so we may have multiple files, none of which is too big. Another task will be linking issues to specific occurrences (if they concern just one occurrence), the GitHub issues will be at level of the complete file. This could be handled in a form-based interface on GBIF that sent the occurrenceID as part of the issue report.


The key point of this proposal is that everything is in place already to do this. The ducks are lining up, and serious, credible projects are handling the things we need (versioning, identifiers, credit). Sometimes the smart thing is to do nothing and wait to someone else solves the problems you face. I think the waiting may be over.