Friday, September 11, 2020

Darwin Core Million reminder, and thoughts on bad data

Bob mesibovThe following is a guest post by Bob Mesibov.

No winner yet in the second Darwin Core Million for 2020, but there are another two and a half weeks to go (to 30 September). For details of the contest see this iPhylo blog post. And please don’t submit a million RECORDS, just (roughly) a million DATA ITEMS. That’s about 20,000 records with 50 fields in the table, or about 50,000 records with 20 fields, or something arithmetically similar.


The purpose of the Darwin Core Million is to celebrate high-quality occurrence datasets. These are extraordinarily rare in biodiversity informatics.

I’ll unpick that. I’m not talking about the accuracy of the records. For most records, the “what”, “where”, “when” and “by whom” are probably correct. An occurrence record is a simple fact: Wilma Flintstone collected a flowering specimen of an Arizona Mountain Dandelion 5 miles SSE of Walker, California on 27 June 2019. More technically, she collected Agoseris parviflora at 38.4411 –119.4393, as recorded by her handheld GPS.

What could possibly go wrong in compiling a dataset of simple records like that in a spreadsheet or database? Let me count a few of the ways:

  • data items get misspelled or misnumbered
  • data items get put in the wrong field
  • data items are put in a field for which they are invalid or inappropriate
  • data items that should be entered get left out
  • data items get truncated
  • data items contain information better split into separate fields
  • data items contain line breaks
  • data items get corrupted by copying down in a spreadsheet
  • data items disagree with other data items in the same record
  • data items refer to unexplained entities (“habitat type A”)
  • paired data items don’t get paired (e.g. latitude but no longitude)
  • the same data item appears in different formats in different records
  • missing data items are represented by blanks, spaces, “?”, “na”, “-”, “unknown”, “not recorded” etc, all in the same data table
  • character encoding failures create gibberish, question marks and replacement characters (�)
  • weird control characters appear in data items, and parsing fails
  • dates get messed up (looking at you, Excel)
  • records get duplicated after minor edits

In previous blog posts (here and here) I’ve looked at explanations for poor-quality data at the project, institution and agency level — data sources I referred to collectively as the “PIA”. I don’t think any of those explanations are controversial. Here I’m going to be rude and insulting and say there are three further obstacles to creating good, usable and shareable occurrence data:

Datasets are compiled as though they were family heirlooms.

The PIA says “This database is OUR property. It’s for OUR use and WE understand the data, even if it’s messy and outsiders can’t figure out what we’ve done. Ambiguities? No problem, we’ll just email Old Fred. He retired a few years back but he knows the system back to front.”

Prising data items from these heirlooms, mapping them to new fields and cleaning them are complicated exercises best left to data specialists. That’s not what happens.

Datasets are too often compiled by people with inadequate computer skills. Their last experience of data management was building a spreadsheet in a “digital learning” class. They’re following instructions but they don’t understand them. Both the data enterers and their instructors are hoping for a good result, which is truly courageous optimism.

The (often huge) skills gap between the compilers of digital PIA data and the computer-savvy people who analyse and reformat/repackage the data (users and facilitators-for-users) could be narrowed programmatically, but isn’t. Hands up all those who use a spreadsheet for data entry by volunteers and have comprehensive validation rules for each of the fields? Thought so.

People confuse software with data. This isn’t a problem restricted to biodiversity informatics, and I’ve ranted about this issue elsewhere. The effect is that data compilers blame software for data problems and don’t accept responsibility for stuff-ups.

Sometimes that blaming is justified. As a data auditor I dread getting an Excel file, because I know without looking that the file will have usability and shareability issues on top of the usual spreadsheet errors. Excel isn’t an endpoint in a data-use pipeline, it’s a starting point and a particularly awful one.

Another horror is the export option. Want to convert your database of occurrence records to format X? Just go to the “Save as” or “Export data” menu item and click “OK”. Magic happens and you don’t need to check the exported file in format X to see that all is well. If all is not well, it’s the software’s fault, right? Not your problem.

In view of these and the previously blogged-about explanations for bad data, it’s a wonder that there are any high-quality datasets, but there are. I’ve audited them and it’s a shame that for ethical reasons I can’t enter them myself in the current Darwin Core Million.