Tuesday, May 18, 2021

Preprint on Wikidata and the bibliography of life

Last week I submitted a manuscript entitled "Wikidata and the bibliography of life". I've been thinking about the "bibliography of life" (AKA a database of every taxonomic publication ever published) for a while, and this paper explores the idea that Wikidata is the place to create this database. The preprint version is on bioRxiv (doi:10.1101/2021.05.04.442638). Here's the abstract:

Biological taxonomy rests on a long tail of publications spanning nearly three centuries. Not only is this literature vital to resolving disputes about taxonomy and nomenclature, for many species it represents a key source - indeed sometimes the only source - of information about that species. Unlike other disciplines such as biomedicine, the taxonomic community lacks a centralised, curated literature database (the “bibliography of life”). This paper argues that Wikidata can be that database as it has flexible and sophisticated models of bibliographic information, and an active community of people and programs (“bots”) adding, editing, and curating that information. The paper also describes a tool to visualise and explore bibliography information in Wikidata and how it links to both taxa and taxonomists.

The manuscript summarises some work I've been doing to populate Wikidata with taxonomic publications (building on a huge amount of work already done), and also describes ALEC which I use to visualise this content. I've made various (unreleased) knowledge graphs of taxonomic information (and one that I have actually released Ozymandias), I'm still torn between whether the future is to invest more effort in Wikidata, or construct lighter, faster, domain specific knowledge graphs for taxonomy. I think the answer is likely to be "yes".

Meantime, one chart I quite like from the submitted version of this paper is shown below.

It's a chart that is a bit tricky to interpret. My goal was to get a sense of whether bibliographic items added to Wikidata (e.g., taxonomic papers) were actually being edited by the Wikidata community, or whether they just sat there unchanged since they were added. If people are editing these publications, for example, by adding missing author names, linking papers to items for their authors, or adding additional identifiers (such as DOIs, ZooBank identifiers, etc.), then there is clear value in using Wikidata as a repository of bibliographic data. So I grabbed a sample of 1000 publications, retrieved their edit history from Wikidata, and plotted the creation timestamp of each item against the timestamps for each edit made to that item. If items were never edited then every point would fall along the diagonal line. If edits are made, they appear to the right of the diagonal. I could have just counted edits made, but I wanted to visualise those edits. As the chart shows, there is quite a lot of editing activity, so there his a community of people (and bots) curating this content. In many ways this is the strongest argument for using Wikidata for a "bibliography of life". Any database needs curation, which means people, and this is what Wikidata offers, a community of people who care about often esoteric details, and get pleasure from improving structured data.

There are still huge gaps in Wikidata's coverage of the taxonomic literature. Once you move beyond the "low hanging fruit" of publications with CrossRef DOIs the task of adding literature to Wikidata gets a bit more complicated. Then there is the reconciliation problem: given an existing taxonomic database with a list of references, how do we match those references to the corresponding items in Wikidata? There is still a lot to do.