Wednesday, July 01, 2020

Diddling with semantic data: linking natural history collections to the scientific literature

A couple of weeks ago I was grumpy on the Internet (no, really) and complained about museum websites and how their pages often lacked vital metadata tags (such as rel=canonical or Facebook Open Graph tags). This got a response:
Vince's lovely line "diddle with semantic data" is the inspiration for the title of this post, in which I describe a tool to display links across datasets, such as museum specimens and scientific publications. This tool builds on ideas I discussed in 2014(!) (Rethinking annotating biodiversity data, see also "Towards a biodiversity knowledge graph" doi:10.3897/rio.2.e8767).


TL;DR;

If you want the quick summary, here it is. If we have persistent identifiers (PIDs) for specimens and publications (or anything other entities of interest), and we have a databases of links between pairs of PIDs (e.g., paper x mentions specimen y), and both entities have web pages, then we can display that relationship on both web pages using a Javascript bookmarklet. We can do this without permission, in the sense that the specimen web page can be hosted by a museum (e.g., The Natural History Museum in London) and the publication hosted by a publisher (e.g., The Public Library of Science), and neither organisation need know about the connection between specimen and publication. But because we do, we can add that connection. (Under the hood this approach relies on a triplestore that stores the relationships between pairs of entities using the Web Annotation Data Model.)


Background

Consider the web page https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef which is for a specimen of the cestode Gangesia agraensis in the NHM (catalogue number 2012.8.23.3). If you look at this page the information content is pretty minimal, which is typical of many natural history specimens. In particular, we have no idea if anyone has done anything with this specimen. Has it been sequenced, imaged, or mentioned in a publication? Who knows? We have no indication of the scientific importance or otherwise of this specimen.



Now, consider the page https://doi.org/10.1371/journal.pone.0046421 for the PLoS ONE paper Revision of Gangesia (Cestoda: Proteocephalidea) in the Indomalayan Region: Morphology, Molecules and Surface Ultrastructure. This paper has most of bells and whistles of a modern paper, including metrics of attention. However, despite this paper using specimens from the NHM there is no connection between the the paper and the museum's collections.


Making these connections is going to be important for tracking the provenance of knowledge based on those specimens, as well as developing metrics of collection use. Some natural collection web sites have started to show these sort of links, but we need them to be available on a much larger scale, and the links need to be accessible not just on museum sites but everywhere specimens are used.  Nor is this issue restricted to natural history collections. My use of "PIDs" in this blog post (as opposed, say, to GUIDs) is that part of the motivation for this work is my participation in the Towards a National Collection - HeritagePIDs project (@HeritagePIDs), whose scope includes collections and archives from nay different fields.


Magic happens

The screenshots below show the same web pages as before, but now we have a overlay window that displays additional information. For specimen 2012.8.23.3 we see a paper (together with a list of the authors, each sporting an ORCID). This is the PloS ONE paper, which cites this specimen.




Likewise if we go to the PLoS ONE paper, we now see a list of specimens from the NHM that are mentioned in that paper.




What happened?

The overlay is generated by a bookmarklet, a piece of Javascript that displays an overlay on the right hand side of the web page, then does two things:
  1. It reads the web page to find out what the page is "about" (the main entity). It does this by looking for tags such as rel="canonical", og:url, or a meta tag with a DOI. It turns out that lots of relevant sites don't include a machine readable way of saying what they are about (which led to my tweet that annoyed Vince Smith, see above). While it may be "obvious" to a human what a site is about, we need to spell that out for computers. The easy way to do this is explicitly include a URL or other persistent identifier for the subject of the web page.
  2. Once the script has figured out what the page is about, it then talks to a triple store that I have set up and asks "do you have any annotations for this thing?". If so, they are returned as a DataFeed (basically a JSON-LD variant of an RSS feed) and the results are displayed in the overlay.
Step one hinges on the entity of interest having a persistent identifier, and that identifier being easy to locate in the web page. Academic publishers are pretty good at doing this, mainly because it increases their visibility to search engines such as Google Scholar, and also it helps reference managers such as Zotero automatically extract bibliographic data for a paper. These drivers don't exist for many types of data (such as specimens, or DNA sequences, or people), and so often those sites will need custom code to extract at the corresponding identifier.

Step two requires that we have a database somewhere that knows whether two things are linked. For various reasons I've settled on using a triplestore for this data, and I'm modelling the connection between two things as an annotation. Below is the (simplified) JSON-LD for an annotation linking the NHM specimen 2012.8.23.3 to the paper Revision of Gangesia (Cestoda: Proteocephalidea) in ... .

{
  "type": "Annotation",
  "body": {
 "id": "https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef",
 "name": "2012.8.23.3"
  },
  "target": {
 "name": "Revision of Gangesia (Cestoda: Proteocephalidea) in ...",
 "canonical": "https://doi.org/10.1371/journal.pone.0046421"
  }
}

Strictly speaking we could have something even more minimal:

{
  "type": "Annotation",
  "body": "https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef",
  "target": "https://doi.org/10.1371/journal.pone.0046421"
}

But this means we couldn't display the names of the specimen and the paper in the overlay. (The use of canonical in the target is preparation for when annotations will be made on specific representations, such as a PDF of a paper, the same paper in HTML, etc. and I want to be able to group those together.)

Leaving aside these technical details, the key thing is that we have a simple way to link two things together.


Where do the links come from?

Now we hit the $64,000 Question, how do we know that specimen https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef and paper https://doi.org/10.1371/journal.pone.0046421 are linked? To do that we need to text mine papers looking for specimen codes (like 2012.8.23.3), discover the persistent identifier that corresponds to that code, then combine that with the persistent identifier for the entity that refers to that specimen (such as a paper, a supplementary data file, or a DNA sequence).

For this example I'm spared that work because Ross Mounce (@rmounce) and Aime Rankin (@AimeRankin) did exactly that for some NHM specimens (see doi:10.5281/zenodo.34966 and https://github.com/rossmounce/NHM-specimens). So I just wrote a script to parse a CSV file and output specimen and publication identifiers as annotations. So that I can display more I also grabbed RDF for the specimens, publications, and people. The RDF for the NHM specimens is available by simply appending an extension (such as .jsonld) to the specimen URL, you can get RDF for people and their papers from ORCID (and other sources).

As an aside, I could use Ross and Aime's work "as is" because the persistent identifiers had changed (sigh). The NHM has changed specimen URLs (replacing /specimen/ with /object/) and switched from http to https. Even the DOIs have changed in that the HTTP resolver http://dx.doi.org has now been replaced by https://doi.org. So I had to fix that. If you want this stuff to work DO NOT EVER CHANGE IDENTIFIERS!


How can I get this bookmarklet thingy?

To install the bookmarklet go to https://pid-demonstrator.herokuapp.com and click and hold the "Annotate It!" Link, then drag it to your web browser toolbar (on Safari it's the "Favourites Bar", on Chrome and Firefox it's the "Bookmarks Bar"). When you are looking at a web page click "Annotate It!". At the moment the NHM PLoS example above is the only one that does anything interesting, this will change as I add more data.

My horribly rough code is here: https://github.com/rdmpage/pid-demonstrator.


What's next?

The annotation model doesn't just apply to specimens. For example, I'd love to be able to flag pages in BHL as being of special interest, such as "this page is the original description of this species"). This presents some additional challenges because the user can scroll through BHL and change the page, so I would need the bookmarklet to be aware of that and query the triplestore for each new page. I've got the first part of the code working, in that if you try the bookmarklets on a BHL page it "knows" when you've scrolled to a different page.



I obviously need to populate the triplestore with a lot more annotations. I think the simplest way forward is just to have spreadsheets (e.g., CSV files) with columns of specimen identifiers and DOI and convert those into annotations.

Lastly, another source of annotations are those made by readers using tools such as hypothes.is, which I've explored earlier (see Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday). So we can imagine a mix of annotations made by machine, and annotations made by people, both helping construct a part of the biodiversity knowledge graph. This same graph can then be used to explore the connections between specimens and publications, and perhaps lead to metrics of scientific engagement with natural history collections.