Wednesday, May 31, 2017

Wikidata, WikiCite, and the "bibliography of life"

3hhZSGOn 400x400Last week I was at WikiCite 2017, a fascinating three day event in Vienna. Wikicite is "a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects", and is attracting increasing attention from academics, librarians, publishers, data geeks, and others. You can get a sense of the project by following @WikiCite on Twitter.

I went to the meeting in part to learn more about WikiCite, and also to spend some time hacking on Wikispecies. I'd been to only one Wiki event before (a Wiki Science Conference) so I'm still finding my way around this community. I spent the first two days listening to talks while coding away (more on this below), but on Wednesday put my own coding aside to join a bunch of people hacking the CrossRef event API in a great session led by Joe Wass. I've put some notes and code in GitHub. The event API tracks what people do with DOIs, including adding them to Wikipedia pages when citing a source in support of an assertion. A significant fraction of DOI resolutions are from Wikipedia pages, which is one reason why CrossRef was present at WikiCite.

Wikidata

In practice WikiCite's goal of building a bibliographic database to serve all Wikimedia projects means that articles, books, and other bibliographic items that are cited by Wikimedia projects will each be added to Wikidata. For example, the ZooKeys paper "Diversity of manota williston (Diptera, mycetophilidae) in ulu temburong national park, brunei" is item Q21188431 in Wikidata. Wikidata stores the key bibliographic metadata, including identifiers such as the DOI (which many at the WikiCite meeting pronounced "doy" much to my initial confusion). Screenshot 2017 05 31 12 46 43

This article was published in ZooKeys, which itself has a Wikidata item (Q219980), so in Wikidata the article is linked to the journal (i.e., "ZooKeys" isn't just a dumb string but a link to another Wikidata item). The article is also linked to two articles that it cites, and each of these is also a Wikidata item.

These citation links are one reason people are interested in WikiCite - it could be the basis of a free and open citation graph (for the benefits of such a graph see this piece by David Shotton doi:10.1038/502295a, a participant at the meeting in Vienna). Already some cool tools are being built on top of citation data in Wikidata, such as Scholia by Finn Årup Nielsen, Daniel Mietchen and Egon Willighagen. Here, for example, is my academic profile based on information in Wikidata. It's woefully incomplete, but intriguing. For a more complete example view Egon Willighagen's profile.

To some extent the utility of tools like Scholia will depend on how complete Wikidata's coverage is of the academic literature, which in turn raises the inevitable question of scope. Does Wikicite want to include just the literature cited in the various Wikimedia projects, or does it want to expand to include the total sum of academic literature?

Wikispecies, Wikidata and the bibliography of life

Wikispecies is one of the Wikimedia projects, and the only one that is topic-specific (the others are typically global in scope but have content in different languages, or host different data types such as images, scanned books, or structured data). As I've sketched out in an earlier post (Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library) I think Wikicite and Wikidata are potentially very important to projects such as BHL and the "bibliography of life". Much of our knowledge about the world's biodiversity is contained in the academic literature, and much of this is poorly known with no central database where we can find it, and much of it is still not digitised. It is tempting to think that Wikidata might be a platform around which the biodiversity community could focus its efforts on assembling a global database of biodiversity literature. Already major taxonomic journals such as ZooKeys are being fed into Wikidata, so it has a significant corpus of biodiversity literature already.

One way to grow this corpus is to focus on Wikispecies. In a post before the Wikicite meeting (Notes for WikiCite 2017: Wikispecies reference parsing) I elaborated on this idea. There are two stumbing blocks, one specific to Wikispecies, one a more general Wikidata issue.

The first issue is that Wikispecies bibliographic data is relatively unstructured, which makes converting it into structured data something of a challenge. I spent much of Wikicite hacking some code to do this on Glitch (more on Glitch later), you can see the results here: https://acoustic-bandana.glitch.me. This web site takes a Wikispecies reference and tries to convert it into CSL-JSON. Still very much a work in progress, but I've started building tools that use this web site as a service and process larger numbers of Wikispecies citations.

The second issue is how you get data into Wikidata, and this is something that's never been entirely clear to me. There are tools for adding an article using its DOI (sourcemd) but this isn't scalable, and doesn't handle the case of articles that don't have DOIs. This is still a "How do you Snapchat? You just Snapchat" moment. Wikidata desparately needs tools and a clear procedure whereby people like me with lots of bibliographic data can contribute.

Wikispecies

Another reason for my interest in Wikispecies (and other sources of bibliographic data such as the listed of cited literature being made available by CrossRef, see The Initiative for Open Citations) is that this data can be fed into BHL to locate more articles in that archive. Once these articles have been located they are stored in BioStor and BHL itself, but it makes sense to have them more accessible, and Wikidata looks to be an obvious candidate. Given that Wikispecies is essentially a crowd-source taxonomic database there is considerable overlap in content between Wikispecies and BHL. The Wikidata data model also allows for some of things that taxonomists care about, such as linking dates of publication to evidence relative to those dates (in older publications determining the publication date often requires quite extensive research).

Summary

Leaving aside the specific issues about how to get bibliographic data into Wikidata, I guess the question to ask is whether it makes sense to be developing large databases of bibliographic data without either using Wikidata as the platform to hold that data, or at least linking to Wikidata. Projects such as Gene Wiki are migrating from Wikipedia to Wikidata (see "Wikidata as a semantic framework for the Gene Wiki initiative" doi:10.1093/database/baw015), perhaps those of us interested in biodiversity literature could use projects like Gene Wiki as role models for how we could both contribute and benefit from Wikidata and Wikicite.

I've barely scratched the surface of what was discussed at Wikicite, for more details see the program. It is a very different sort of meeting in that the participants come from pretty diverse backgrounds, which helps shake up your own assumptions about what matters and how things should be done. It's also great that it's a meeting at which people write code or otherwise hack stuff together, so things actually get done. I've come away with lots to think about, and renewed enthusiasm about the role Wikimedia is playing in structuring our knowledge about the world.