Wednesday, January 11, 2017

The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor

I've added an experimental feature to BioStor that uses data from Wikidata and Wikispecies to augment what information BioStor displays on authors. This is a crude first step towards the goal of representing all the data in BioStor as a "knowledge graph" where articles, journals, and authors are all treated as entities, all have identifiers, and we can explore relationships between those entities (e.g., citation, co-authorship, etc.). At the moment this is true of articles, which have Biostor URLs (and in many cases DOIs), and for most journals which are identified by their ISSN. Using identifiers helps reduce ambiguity, especially if there are multiple ways to represent the same thing (e.g., all the alternative ways to write a journal name can be circumvented by using the journal's ISSN).

However, BioStor doesn't have a way to identify authors beyond simply searching for a name. As a first step to tackling this problem I've added a little widget that displays information about an author based on the name you are searching for. For example, searching for George Albert Boulenger will give you a list of publications where the author name is "George Albert Boulenger", as well as a picture of the author and some identifiers (from sources such as VIAF, ISNI, IPNI, and Wikidata):

Screenshot 2017 01 11 16 30 57

For now this widget is independent of the data in BioStor. I don't link an article to its author(s) using identifiers for those authors, nor have I tackled the problem of clustering all the variations in people's names together into one set of names that share the same identifier (see Equivalent author names) nor do I attempt to match names to identifiers (see Reconciling author names using Open Refine and VIAF) other than by an exact text search (for details see below). At this stage I just want to get a sense of what identifiers exist for an author, and what I can learn from those identifiers. I also want to explore the potential of Wikispecies as a source of data on people and publications, and how this relates to Wikidata (for earlier thoughts on using Wikipedia for the same goal see Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library).

Wikispecies

I confess I've never really "got" Wikispecies (e.g., Wikispecies is not a database), it seems to exist in isolation from Wikipedia, which is arguably more informative about many species. But there are a couple of things Wikispecies does very well. Firstly, it is building a rich, crowd-sourced bibliography of papers on the taxonomy of many different species. Readers of iPhylo will recall how many times I've expressed frustration at the nearly evidence-free nature of many online taxonomic databases that simply have lists of names unconnected to the primary literature. Many Wikispecies pages have long lists of papers, making it a potential goldmine. Recently there is a lot of interest in extracting bibliographic data from Wikipedia (see WikiCite). Wikispecies could also be harvested, although a major obstacle any such project faces is the lack of a consistent format for references in Wikispecies.

The other nice thing about Wikipecies is that it has articles on taxonomic authorities, and these often list publications by those authors, and also list external identifiers for those authors, such as the VIAF and ISNI identifiers used in the library world, IPNI and ZooBank identifiers used in taxonomic databases, and ORCID which is becoming the de-facto identifier for academic researchers. This information also ends up in Wikidata.

Using Wikidata to glue things together

Wikidata is an interesting project that, like Wikispecies, I've been in two minds about (see Wikidata, Wikipedia, and #wikisci). However, I've started to make more use of it recently. Inspired by the Wikidata:SPARQL query service/2016 SPARQL Workshop I decided to explore the SPARQL query interface to Wikidata. I was struck by one of the example queries involving Wikispecies, and so after a little bit of messing about came up with a query that takes the name of an author and returns some identifiers from Wikidata, as well as an image of that person if one is available. I restrict the results to people that have an article about them in Wikispecies, because I want start exploring using those articles to make assertions about authorship. Here is a query to search for "George Albert Boulenger":

SELECT *
WHERE
{
  ?item rdfs:label "George Albert Boulenger"@en .
  ?article schema:about ?item .
  ?article schema:isPartOf  .
  OPTIONAL {
   ?item wdt:P213 ?isni .
	}
  OPTIONAL {
   ?item wdt:P214 ?viaf .
	}
  OPTIONAL {
   ?item wdt:P18 ?image .
	}
  OPTIONAL {
   ?item wdt:P496 ?orcid .
	}
  OPTIONAL {
   ?item wdt:P586 ?ipni .
	}
  OPTIONAL {
   ?item wdt:P2006 ?zoobank .
	}
}

This query simply asks whether Wikidata has an item on this person, whether that item is linked to Wikispecies, what identifiers Wikidata has, and whether there is an image of the person. You can see the query "live" here:

I've added some code to BioStor to do this query on the fly, and display the results. So, for Boulenger we get: Screenshot 2017 01 11 17 04 16 Here is the result for noted carcinologist Jocelyn Crane who currently lacks identifiers: Screenshot 2017 01 11 17 05 32 A nice surprise was Bernard Landry: Screenshot 2017 01 11 17 07 14 Note the ORCID 0000-0002-6005-1067. Interestingly, Bernard Landry's ORCID profile doesn't list any publications, whereas we can see lists of these in BioStor and Wikispecies.

Where next?

There are several obstacles to mapping the names of authors to identifiers. One is simply the lack of identifiers. This seems to be rapidly becoming less of a problem with the efforts of the library community around VIAF, the rise of ORCID for living researchers, and the creation of Wikidata items for every taxonomist in Wikispecies. The next challenge is clustering the different ways of writing the same person's name into sets that represent the same person. As discussed above, there are tools for this. Furthermore, with Wikipedia and Wikispecies we have sources of lists of publications linked to a person and their identifiers, which should simplify the task considerably. What is nice about this is that it relies on a crowd-sourcing effort which is already well-established, namely those people who in adding articles to Wikispecies and Wikipedia are created a curated database of publications linked to authors. In many cases those publications are linked to BHL (the source that BioStor extracts its articles from), so many of the links between publications and people are essentially lying there, just waiting for some skilful harvesting.