Wednesday, July 15, 2020

Darwin Core Million now twice a year

Bob mesibovThe following is a guest post by Bob Mesibov.

The first Darwin Core Million closed on 31 March with no winner. Since I'm seeing better datasets this year in my auditing work for Pensoft, I've decided to run the competition every six months.

Missed the first Darwin Core Million and don't know what it's about? Don't get too excited by the word "million". It refers to the number of data items in a Darwin Core occurrences table, not to the prize!

The rules

  • Anyone can enter, but the competition applies only to publicly available Darwin Core occurrence datasets. These might have been uploaded to an aggregator, such as GBIF, ALA or iDigBio, or to an open-data repository.
  • Select about one million data items from the dataset. That could be 50000 records in 20 populated Darwin Core fields, or 20000 records in 50 populated Darwin Core fields, or something in between. Email the dataset to me after 1 September and before 30 September as a zipped, plain-text file, together with a DOI or URL for the online version of the dataset.
  • I'll audit datasets in the order I receive them. If I can't find serious data quality problems (see below) in your dataset, I'll pay your institution AUD$150 and declare your institution the winner of the Darwin Core Million here on iPhylo. There's only one winner in each competition round; datasets received after the first problem-free dataset won't be checked.
  • If I find serious data quality problems, I'll let you know by email. If you want to learn what the problems are, I'll send you a report detailing what should be fixed and charge your institution AUD$150. At 0.3-0.75c/record, that's a bargain compared to commercial data-checking rates. And it would be really good to hear, later on, that those problems had indeed been fixed and that corrected data items had replaced the originals online.

How the data are judged

For a list of data quality problems, see this page in my Data Cleaner's Cookbook. The key problems I look for are:

  • duplicate records
  • invalid data items
  • missing-but-expected items
  • data items in the wrong fields
  • data items inappropriate for their field
  • truncated data items
  • records with items in one field disagreeing with items in another
  • character encoding errors
  • wildly erroneous dates or coordinates
  • incorrect or inconsistent formatting of dates, names and other items

This is not just nit-picking. Your digital data items aren't mainly for humans to read and interpret, they're intended in the first place for parsing and managing by computers. "Western Hill" might not be the same as "Western Hill" in processing, for example, because the second placename might have a no-break space between the words instead of a plain space. Another example: humans see these 22 variations on collector names as "the same", but computers don't.

Please also note that data quality isn't the same as data accuracy. Is Western Hill really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? These are questions about data accuracy. But it's possible to have entirely correct digital data that can't be processed by an application, or moved between applications, because the data suffer from one or more of the problems listed above.

Fine points

I think I'm pretty reasonable about the "serious" in "serious data quality problems". One character encoding error, such as "L'H?rit" repeated in the "scientificNameAuthorship" field, isn't serious, but multiple errors scattered through several fields are grounds for rejection.

For an understanding of "invalid", please refer to the Darwin Core field definitions and recommendations.

"Missing-but-expected" is important. I've seen GBIF mis-match a scientific name because the Darwin Core "kingdom" field was left blank by the data provider, even though all the other higher-taxon fields were filled in.

Please remember, entries received before 1 September won't be audited.