Monday, April 13, 2020

Wikidata and the bibliography of life in the time of coronavirus

I haven't posted on iPhylo for a while, and since my last post back in January things have obviously changed quite a bit. In late January and early February I was teaching a course on biodiversity informatics, and students discovered the John Hopkins coronavirus dashboard, which seemed like a cool way to display information on a situation that was happening on the other side of the world. All fairly abstract.



Today the dashboard looks rather different, and things are no longer quite so abstract (and, of course, never were for the population of Wuhan).



At the same time as the pandemic is affecting so many lives (and taking those of people who had a big impact on my childhood), there is the extraordinary case of open access champion Jon Tennant (@protohedgehog). On April 8th I received an item from his email newsletter entitled Converting adversity into productivity, detailing how he'd managed to get through a traumatic period prior to corona virus, and how productive he had managed to be (his email lists a whole slew of articles he'd written). The next day, this:

The day before, this happened:

Times like this tend to focus the mind, and for anyone with research skills the question arises "what should I be doing?". Some people are addressing issues directly or indirectly relate to the pandemic. It feels like every second post on Medium features someone playing data scientist with coronavirus data. Others are taking existing tools and projects and looking for ways to make them relevant to the problem, such as Plazi and Pensoft seeking to improve access to the biology of corona virus hosts, as part of their broader mission to make biodiversity information more accessible.

Another approach, in some ways what Jon Tennant did, is to use the time to focus on what you think matters and work on that. Of course, this assumes that you are fortunate enough to have the time and resources to do that. I have tenure and my children are grown up, life would be very different without a salary or with small children or other dependents.

One of the things I am increasingly focussing on is the idea of Wikidata as the "bibliography of life". Specifically, I want to get as much taxonomic and related literature into Wikidata, and want to link as much of that to freely-available versions of that literature (e.g., on Internet Archive), I want that literature embedded in the citation graph, linked to authors, and linked to the taxa treated in those papers. A lot of literature is already going into Wikidata via bots that consume the stream of papers with CrossRef DOIs and upload their details to Wikidata, but there is a huge corpus of literature that this approach overlooks. Not only do we have Digital libraries like the Biodiversity Heritage Library and JSTOR, but there is a long tail of small publishers making taxonomic literature available online, and I want this to all be equally discoverable.

One aspect of this project is to populate Wikidata with this missing literature. Over the years as part of projects such as BioNames and BioStor I have accumulated hundreds of thousands of bibliographic references. These aren't much use sitting on my hard drive. Adding them to Wikidata makes them more accessible, and also enables others to make them much richer. For example, the irrepressible @SiobhanLeachman regularly converts author strings to author things:

Adding things to Wikidata is fun, but it can be a struggle to get a sense of what is in Wikidata and how it is interconnected. So I've started to build a simple app that helps show me people, publications, journals, and taxa in a fairly conventional way, all powered by Wikidata. The app is live at https://alec-demo.herokuapp.com. It is not going to win any prizes for performance or design, but I find it useful.

Partly I'm trying to make the original articles more accessible, e.g.:

I'm keen to link taxonomists to their publications and ultimately the taxa they work on:

And we can link taxa and publications visually:

The community-based, somewhat chaotic consensus-driven approach of Wikidata can be frustrating ("well, if you'd asked ME, I wouldn't have done it that way"), but I think it's time to accept that this is simply the nature of the beast, and marvel at the notion that we have a globally accessible and editable knowledge graph. We can stay in our domain-specific silos, where we can control things but remain bereft of both users and contributors. However if we are willing to let go of that control, and accept that things won't always be done the way we think would be optimal, there is a lot of freedom to be gained by deferring to Wikidata's community decisions and simply getting on with building the bibliography of life. Maybe that is something worthwhile to do in this time of coronavirus.