Tuesday, September 11, 2018

Guest post - Quality paralysis: a biodiversity data disease

Bob mesibovThe following is a guest post by Bob Mesibov.

In 2005, GBIF released Arthur Chapman's Principles of Data Quality and Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data as freely available electronic publications. Their impact on museums and herbaria has been minimal. The quality of digitised collection data worldwide, to judge from the samples I've audited (see disclaimer below), varies in 2018 from mostly OK to pretty awful. Data issues include:

  • duplicate records
  • records with data items in the wrong fields
  • records with data items inappropriate for a given field (includes Chapman's "domain schizophrenia")
  • records with truncated data items
  • records with items in one field disagreeing with items in another
  • character encoding errors and mojibake
  • wildly erroneous dates and spatial coordinates
  • internally inconsistent formatting of dates, names and other data items (e.g. 48 variations on "sea level" in a single set of records)

In a previous guest post I listed 10 explanations for the persistence of messy data. I'd gathered the explanations from curators, collection managers and programmers involved with biodiversity data projects. I missed out some key reasons for poor data quality, which I'll outline in this post. For inspiration I'm grateful to Rod Page and to participants in lively discussions about data quality at the SPNHC/TDWG conference in Dunedin this August.

  1. Our institution, like all natural history collections these days, isn't getting the curatorial funding it used to get, but our staff's workload keeps going up. Institution staff are flat out just keeping their museums and herbaria running on the rails. Staff might like to upgrade data quality, but as one curator wrote to me recently, "I simply don't have the resources necessary."
  2. We've been funded to get our collections digitised and/or online, but there's nothing in the budget for upgrading data quality. The first priority is to get the data out there. It would be nice to get follow-up funding for data cleaning, but staff aren't hopeful. The digitisation funder doesn't seem to think it's important, or thinks that staff can deal with data quality issues later, when the digitisation is done.
  3. There's no such thing as a Curator of Data at our institution. Collection curators and managers are busy adding records to the collection database, and IT personnel are busy with database mechanics. The missing link is someone on staff who manages database content. The bigger the database, the greater the need for a data curator, but the usual institutional response is "Get the collections people and the IT people together. They'll work something out."
  4. Aggregators act too much like neutrals. We're mobilising our data through an aggregator, but there are no penalties if we upload poor-quality data, and no rewards if we upload high-quality data. Our aggregator has a limited set of quality tests on selected data fields and adds flags to individual records that have certain kinds of problems. The flags seem to be mainly designed for users of our data. We don't have the (time/personnel/skills) to act on this "feedback" (or to read those 2005 GBIF reports).

There's a 15th explanation that overlaps the other 14 and Rod Page has expressed it very clearly: there's simply no incentive for anyone to clean data.

  • Museums and herbaria don't get rewards, kudos, more visitors, more funding or more publicity if staff improve the quality of their collection data, and they don't get punishments, opprobrium, fewer visitors, reduced funding or less publicity if the data remain messy.
  • Aggregators likewise. Aggregators also don't suffer when they downgrade the quality of the data they're provided with.
  • Users might in future get some reputational benefit from alerting museums and herbaria to data problems, through an "annotation system" being considered by TDWG. However, if users clean datasets for their own use, they get no reward for passing blocks of cleaned data to overworked museum and herbarium staff, or to aggregators, or to the public through "alternative" published data versions.

With the 15 explanations in mind, we can confidently expect collection data quality to remain "mostly OK to pretty awful" for the foreseeable future. Data may be upgraded incrementally as loans go out and come back in, and as curators, collection managers and researchers compare physical holdings one-by-one with their digital representations. Unfortunately, the improvements are likely to be overwhelmed by the addition of new, low-quality records. Very few collection databases have adequate validation-on-entry filters, and staff don't have time for, or assistance with checking. Or a good enough reason to check.

"Quality paralysis" is endemic in museums and herbaria and seems likely to be with us for a long time to come.

DISCLAIMER: Believe it or not, this post isn't an advertisement for my data auditing services.

I began auditing collection data in 2012 for my own purposes and over the next few years I offered free data auditing to a number of institutions in Australia and elsewhere. There were no takers.

In 2017 I entered into a commercial arrangement with Pensoft Publishers to audit the datasets associated with data papers in Pensoft journals, as a free Pensoft service to authors. Some of these datasets are based on collections data, but when auditing I don't deal with the originating institutions directly.

I continue to audit publicly available museum and herbarium data in search of raw material for my website A Data Cleaner's Cookbook and its companion blog BASHing data. I also offer free training in data auditing and cleaning.

Monday, August 20, 2018

GBIF Challenge Entry: Ozymandias

I've submitted an entry for the 2018 GBIF Ebbe Nielsen Challenge. It's a couple of weeks before the deadline but I will be away then so have decided to submit early.

My entry is Ozymandias - a biodiversity knowledge graph. The name is a play on "Oz" being nickname for Australia (much of the data for the entry comes from Australia), and Ozymandias, which is a poem about hubris, and attempting to link biodiversity data requires a certain degree of hubris.

The submission process for the challenge is unfortunately rather opaque compared to previous years when entries were visible to all, so participants could see what other people were submitting, and also knew the identity of the judges, etc. In the spirit of openness here is my video summarising my entry:

Ozymandias - GBIF Challenge Entry from Roderic Page on Vimeo.

There is also a background document here: https://docs.google.com/presentation/d/1UglxaL-yjXsvgwn06AdBbnq-HaT7mO4H5WXejzsb9MY/edit?usp=sharing.

I suspect this entry is not at all what the challenge is looking for, but I've used the challenge as a deadline so that I get something out the door rather than endlessly tweaking a project that only I can see. There will, of course, be endless tweaking as I explore further ways to link data, but at least this way there is something people can look at. Now, I need to spend some time writing up the project, which will require yet more self discipline to avoid the endless tweaking.

Friday, August 17, 2018

Ozymandias demo

I've made a video walkthrough of Ozymandias, which I described in this post. It's a bit, um, long, so I'll need to come up with a shorter version.

Ozymandias - a biodiversity knowledge graph from Roderic Page on Vimeo.

Friday, August 10, 2018

Ozymandias: a biodiversity knowledge graph of Australian taxa and taxonomic publications

In the spirit of release early and release often, here is the first workable version of a biodiversity knowledge graph that I've been working on for Australian animals (for some background on knowledge graphs see Towards a biodiversity knowledge graph now in RIO). The core of this knowledge graph is a classification of animals from the Atlas of Living Australia (ALA) combined with data on taxonomic names and publications from the Australian Faunal Directory (AFD). This has been enhanced by adding lots of digital identifiers (such as DOIs) to the publications and, where possible, full text either as PDFs or as page scans from the Biodiversity Heritage Library (BHL) (provided via BioStor). Identifiers enable us to further grow the knowledge graph, for example by adding "cites" and "cited by" links between publications (data from CrossRef), and displaying figures from the Biodiversity Literature Repository (BLR).

The demo is here: https://ozymandias-demo.herokuapp.com/ If you’re looking for starting points, you could try:

Assassin spiders (images from Plazi and citation data from CrossRef) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/64908f75-456b-4da8-a82b-c569b4806c22

Screenshot 2018 08 10 17 44

Memoirs of Museum Victoria (dynamic query finds record in Wikidata and adds map) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/5c22a8d1-7456-4f8c-9384-1246ecbf15a6

Screenshot 2018 08 10 17 47

G. R. Allen (we can from the taxonomic tree of his top 20 taxa that he studies fish - who knew?) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/%23creator/g-r-allen

Screenshot 2018 08 10 17 47

Paper on mosquito taxonomy with lots of citations, including material in BHL/BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/578d1dec-5816-49ec-8916-3f957fd230f5

Screenshot 2018 08 10 17 47

Paper on Australian flies with full text in BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/0ffe4f28-b8ac-4132-be34-19eb03fbf685

Screenshot 2018 08 10 17 59

The focus for now is on taxa, publications, journals, and people. Occurrences and sequences are on the “to do” list. As always there’s lots of data cleaning and cross linking to do, but an obvious next step is to link people’s names to identifiers such as ORCID and Wikidata ids, so that we can trace the activities of taxonomists as they discover and describe Australian biodiversity (the choice of Australia is simply to keep things manageable, and because the amount of data and digitisation they’ve done is pretty extraordinary). I’m also working to a deadline as I'm trying to get this demo wrapped up in the next couple of weeks.

Technical details

TL;DR the knowledge graph is implemented as a triple store where the data has been represented using a small number of vocabularies (mostly schema.org with some terms borrowed from TAXREF-LD and the TDWG LSID vocabularies). All results displayed in the first two panels are the result of SPARQL queries, the content in the rightmost panel comes from calls to external APIs. Search is implemented using Elasticsearch. If you are feeling brave you can query the knowledge graph directly in SPARQL. I’m constantly tweaking things and adding data and identifiers, so things are likely to break. More details and documentation will be going up on the GitHub repository.

Friday, July 20, 2018

Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden

Supertree Grove Gardens by the Bay Singapore 20120630 04 Earlier this year I stopped over in Singapore, home of the spectacular "supertrees" in the Garden by the Bay. The trip was a holiday, but I spent a good part of one day visiting Rudolf Meier's group at the National University of Singapore. Chatting with Rudolf was great fun, he's opinionated and not afraid to share those opinions with anyone who will listen. Belatedly I've finally written up some of the topics we discussed.

Massively scalable and cheap DNA barcoding

Singapore has a rich fauna in a small area, full of undescribed species, so DNA barcoding seems an obvious way to get a handle on its biodiversity. Rudolf has been working towards scalable and cheap barcoding, e.g. $1 DNA barcodes for reconstructing complex phenomes and finding rare species in specimen‐rich samples https://doi.org/10.1111/cla.12115 . His lab can sequence short (~300 bp) barcode sequences for around $US 0.50 per specimen. Their pipeline generates lots of data, accompanied by high quality photographs of exemplar specimens, which contribute to The Biodiversity of Singapore, a "Digital Reference Collection for Singapore's Biodiversity". This site provides a simple but visually striking way to explore Singapore's biota, and is a nice example of what Mitchell Whitelaw calls "generous interfaces". We could do with more of these for biodiversity data.

Screenshot 2018 07 20 05 01

One nice feature of regular COI DNA barcodes is that they are comparable across labs because everyone is sequencing the same stretch of DNA. With short barcodes, different groups may target different regions of the COI gene, resulting in sequences that can't be compared. For example, the 127bp mini barcodes developed in A universal DNA mini-barcode for biodiversity analysis https://doi.org/10.1186/1471-2164-9-214 are completely disjoint from the ~300bp sequenced by Meier's group (I'm trying to keep track of some of these short barcodes here: https://gist.github.com/rdmpage/4f2545eeea4756565925fb4307d9af6b.

The return of regional faunas

In the "old days" of colonial expansion it was common for taxonomists to write volume entitled "The Fauna of [insert colonised country here]". These were regional works focussing on a particular area, often motivated by trying to catalogue animals of potential economic or medical importance, as well as of scientific interest. By limiting their geographic scope, faunal treatments of taxa can sometimes be inadequate. Descriptions of new species from a particular area may be hard to compare with descriptions of species in the same group that occur elsewhere and are described by other taxonomists. It may be that to do the taxonomy of a particular group well you need to treat that group throughout its geographic range, rather then just those species in your geographic area. Hence faunas loose their scientific appeal, despite the attractiveness of having a detailed summary of the fauna of a particular area. DNA sequencing circumvents this problem by having a universally comparable character. You can sequence everything within a geographic region, but those sequences will be directly comparable to sequences found elsewhere. Barcoding makes faunas attractive again, which may help funding taxonomic research because it makes funding projects with a restricted national scope scientifically still worthwhile.

Taxonomic burden and legacy names

As we discover and catalogue more and more of the planet's biodiversity we want to stick names on that biodiversity, and this can be a significant challenge when there is a taxonomic legacy of names that are so poorly described it is hard to establish how they relate to the material we are working with. Even if you have access to the primary literature through digitisation projects like BHL, if the descriptions are poor, if the types are lost or their identity is confused (see for example A New Species of Megaselia Rondani (Diptera: Phoridae) from the Bioscan Project in Los Angeles, California, with Clarification of Confused Type Series for Two Other Species https://doi.org/10.4289/0013-8797.118.1.93 by Emily A. Hartop - who I met on this trip - and colleagues), or can't be sequenced, then these names will remain ambiguous, and potentially clogging up efforts to name the unnamed species. One approach favoured by Rudolf is to effectively wipe the slate clean, declare all ambiguous names before a certain date to be null and void, and start again. This renders (or rather, resets) the notion of priority - given two names for the same species the older name is the one to use - and so is likely to be a hard sell, but it is part of the ongoing discussion about the impact of molecular data on naming taxa. Similar discussions are raging at the moment in mycology, e.g. Ten reasons why a sequence-based nomenclature is not useful for fungi anytime soon https://doi.org/10.5598/imafungus.2018.09.01.11, yet a another reflection of how much taxonomy is driven by technology.

Thursday, July 05, 2018

GBIF at 1 billion - what's next?

GBIF has reached 1 billion occurrences which is, of course, something to celebrate:

An achievement on this scale represents a lot of work by many people over many years, years spent developing simple standards for sharing data, agreeing that sharing is a good thing in the first place, tools to enable sharing, and a place to aggregate all that shared data (GBIF).

So, I asked a question:

My point is not to do this:

Rather it is to encourage a discussion about what happens when we have large amounts of biodiversity data. Is it the case that as we add data we simply enable more of the same kind of science, only better (e.g., more data for species distribution modelling), or do we reach a point where new things become possible?


To give a concrete example, consider iNaturalist. This started out as a Masters project to collect photos of organisms on Flickr. As you add more images you get better coverage of biodiversity, but you still have essentially a bunch of pictures. But once you have LOTS of pictures, and those are labelled with species names, you reach the point where it is possible to do something much more exciting - automatic species identification. To illustrate, I recently took the photos below:

Large2 Large

Note the reddish tubular growths on the leaves. I asked iNaturalist to identify these photos and within a few seconds it came back with Eriophyes tiliae, the Red Nail Gall Mite. This feels like magic. It doesn't rely on complicated analysis of the image (as many earlier efforts at automated identification have done) it simply "knows" that images that look like this are typically of the galls of this mite because it has seen many such images before. (Another example of the impact of big data is Google Translate, initially based on parsing lots of examples of the same text in multiple languages.)

The "1 billion" number is not, by itself, meaningful. It's rather that I hope that while we're popping the champagne and celebrating a welcome, if somewhat arbitrary milestone, I'm hoping that someone, somewhere is thinking about whether biodiversity data on this scale enables something new.

Do I have answers? Not really, but here's one fairly small-scale example. One of the big challenges facing GBIF is getting georeferenced data. We spend a lot of time using a variety of tools and databases to convert text descriptions one collection localities into latitude and longitude. Many of these descriptions include phrases such as "5 mi NW of" and so we've developed parsers to attempt to make sense of these. All of these phrases and the corresponding latitude and longitude coordinates have ended up in GBIF. Now, this raises the possibility that after a point, pretty much any locality phrase will be in GBIF, so a way to georeference a locality is simply to search GBIF for that locality and use the associated latitude and longitude. GBIF itself becomes the single best tool to georeference specimen data. To explore this idea I've built a simple tool on glitch https://lyrical-money.glitch.me that takes a locality description and geocodes it using GBIF.

Screenshot 2018 07 05 07 32

You paste in a locality string and it attempt to find that on a map based on data in GBIF. This could be automated, so you could imagine being able to georeference whole collections as part of the process of uploading the data to GBIF. Yes, the devil is in the details, and we'd need ways to flag errors or doubtful records, but the scale of GBIF starts of open up possibilities like this.

So, my question is, "what's next?".

Wednesday, June 13, 2018

Liberating links between datasets using lightweight data publishing: an example using IPNI and the taxonomic literature

Ipni logo I've written a short paper entitled "Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature" (phew) and put a preprint on bioRxiv (https://doi.org/10.1101/343996) while I figure out where to publish it. Here's the abstract:

Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic names, and identifiers for the taxonomic articles that published those names.

In some ways the paper is simply a record of me trying to figure out how to publish a project that I've been working on for several years, namely linking names from BioNames. The preprint discusses various options, before settling on "datasettes", which is a nice method developed by Simon Willison (@simonw) to wrap up simple databases with their own web server and query API and make them accessible on the web. These can run on a local machine, or be packaged up as a Docker container, which is what I've done. You play with the database here: https://ipni.sloppy.zone. If this link is offline, then you can grab the container here https://hub.docker.com/r/rdmpage/ipni/ and run it yourself. If, like me, you're new to Docker, then I recommend grabbing a copy of Kitematic.

The datasette interface is simple but gives you lots of freedom to explore the data.


For example, you have ability to query the data using SQL, e.g.:


One advantage of this approach is that the data is more accessible. I could just dump the database somewhere but then you'd have to download a large file and figure out how query it. This way, you can play with it straight away. It also means people can make use of it before I make up my mind how best to package it (for example, as part of a larger database of eukaryote names). This is one of the main motivations behind the paper, how to avoid the trap of spending years cleaning and augmenting data and not making it available to others because of the overhead of building a web site around the data. I may look at liberating some other datasets using this approach.