Wednesday, June 13, 2018

Liberating links between datasets using lightweight data publishing: an example using IPNI and the taxonomic literature

I've written a short paper entitled "Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature" (phew) and put a preprint on bioRxiv ( while I figure out where to publish it. Here's the abstract:

Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic names, and identifiers for the taxonomic articles that published those names.

In some ways the paper is simply a record of me trying to figure out how to publish a project that I've been working on for several years, namely linking names from BioNames. The preprint discusses various options, before settling on "datasettes", which is a nice method developed by Simon Willison (@simonw) to wrap up simple databases with their own web server and query API and make them accessible on the web. These can run on a local machine, or be packaged up as a Docker container, which is what I've done. You play with the database here: If this link is offline, then you can grab the container here and run it yourself. If, like me, you're new to Docker, then I recommend grabbing a copy of Kitematic.

The datasette interface is simple but gives you lots of freedom to explore the data.


For example, you have ability to query the data using SQL, e.g.:


One advantage of this approach is that the data is more accessible. I could just dump the database somewhere but then you'd have to download a large file and figure out how query it. This way, you can play with it straight away. It also means people can make use of it before I make up my mind how best to package it (for example, as part of a larger database of eukaryote names). This is one of the main motivations behind the paper, how to avoid the trap of spending years cleaning and augmenting data and not making it available to others because of the overhead of building a web site around the data. I may look at liberating some other datasets using this approach.