Friday, March 24, 2017

Notes for WikiCite 2017: Wikispecies reference parsing

Wikispecies logo svg In preparation for WikiCite 2017 I'm looking more closely at extracting bibliographic information from Wikispecies. The WikiCite project "is a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects". One reason for doing this is so that each factual statement in WikiData can be linked to evidence for that statement. Practical efforts towards this goal include tools to add details of articles from CrossRef and PubMed straight into Wikidata, and tools to extract citations from Wikipedia (as these are likely to be sources of evidence for statements made in Wikipedia articles).

Wikispecies occupies a rather isoldated spot in the Wiikipedia landscape. Unlike other sites which are essentially comprehensive encyclopedias in different languages, Wikispecies focusses on one domain - taxonomy. In a sense, it's a prototype of Wikidata in that it provides basic facts (who described what species when, and what is the classification of those species) that in principle can be reused by any of the other wikis. However, in practice this doesn't seem to have happened much.

What Wikispecies has become, however, is a crowd-sourced database of the taxonomic literture. For someone like me who is desparately gathering up bibliographic data so that I can extract articles from the Biodiversity Heritage Library (BHL), this is a potential goldmine. But, there's a catch. Unlike, say, the English language Wikipedia which has a single widely-used template for describing a publication, Wikispecies has it's own method of representing articles. It uses a somewhat confusing mix of templates for author names, and then uses barely standardised formatting rules to mark out parts of a publication (such as journal, volume, issue, etc.). Instead of a single template to describe a publication, in Wikispecies a publication my itself be described by a unique template. This has some advantages, in that the same reference can be transcluded into multiple articles (in other words, you enter the bibliographic details once). But this leaves us with many individual templates with multiple, idiosyncratic styles of representing bibliographic data. Some have tried to get the Wikispecies community to adopt the same template as Wikipedia (see e.g., this discussion) but this proposal has met with a lot of resistance. From my perspective as a potential consumer of data, the current situation in Wikispecies is frustrating, but the reality is that the people who create the content get to decide how they structure that content. And understandably, they are less than impressed by requests that might help others (such as data miners) at the expense of making their own work more difficult.

In summary, if I want to make use of Wikispecies I am going to need to develop a set of parsers than can make a reasonable fist of parsing all the myriad citation formats used in Wikispecies (my first attempts are on GitHub). I'm looking at parsing the references and converting them to a more standard format in JSON (I've made some notes on various bibliographic formats in JSON such as BibJSON and CSL-JSON). One outcome of this work will be, I hope, more articles discovered in BHL and hence added to BioStor), and more links to identifiers, which could be fed back into Wikispecies. I also want to explore linking the authors of these papers to identifiers, as already sketched out in The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor.