Wednesday, September 23, 2020

Using the hypothes.is API to annotate PDFs

With somewhat depressing regularity I keep cycling back to things I was working on earlier but never quite get to work the way I wanted. The last couple of days it's the turn of hypothes.is.

One of the things I'd like to have is a database of all taxonomic names such that if you clicked on a name you would get not only the bibliographic record for the publication where that name first appeared (which is what I've bene building for animals in BioNames) but also you could see the actual publication with the name highlighted in the text. This assumes that the publication has been digitised (say, as a PDF) and is accessible, but let's assume that this is the case. Now, we could do this manually, but we have tools to find taxonomic names in text. And in my use case I often know which page the name is on, and what the name is, so all I really want is to be able to highlight it programmatically (because I have millions of names to deal with).

So, time to revisit the hypothes.is API. One of the neat "tricks" hypothes.is have managed is the ability to annotate, say, a web page for an article and have that annotation automagically appear on the PDF version of the same article. As described in How Hypothesis interacts with document metadata this is in part because hypothes.is extracts metadata from the article's web page, such as DOI and link to the PDF, and stores that with the annotation (I say "in part" because the other part of the trick is to be able to locate annotations in different versions of the same text). If you annotate a PDF, hypothes.is stores the URL of the PDF and also a "fingerprint" of the PDF (see PDF Fingerprinting for details). This means that you can also add an annotation to a PDF offline (for example, on a file you have downloaded onto your computer) and - if hypothes.is has already encountered this PDF - that annotation will appear in the PDF online.

What I want to do is have a PDF, highlight the scientific name, upload that annotation to hypothes.is so that the annotation is visible online when anyone opens the PDF (and ideally when they look at the web version of the same article). I want to do this programmatically. Long story short, this seems doable. Here is an example annotation that I created and sent to hypothesis.is via their API:

{
    "uri": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
    "document": {
        "highwire": {
            "doi": [
                "10.1590/1678-476620151053372375"
            ]
        },
        "dc": {
            "identifier": [
                "doi:10.1590/1678-476620151053372375"
            ]
        },
        "link": [
            {
                "href": "urn:x-pdf:6124e7bdb33241429158b11a1b2c4ba5"
            }
        ]
    },
    "tags": [
        "api"
    ],
    "target": [
        {
            "source": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
            "selector": [
                {
                    "type": "TextQuoteSelector",
                    "exact": "Alpaida venger sp. nov.",
                    "prefix": "imens preserved in 75% ethanol. ",
                    "suffix": " (Figs 1-9) Type-material. Holot"
                },
                {
                    "type": "TextPositionSelector",
                    "start": 4834,
                    "end": 4857
                }
            ]
        }
    ],
    "user": "acct:xxx@hypothes.is",
    "permissions": {
        "read": [
            "group:__world__"
        ],
        "update": [
            "acct:xxx@hypothes.is"
        ],
        "delete": [
            "acct:xxx@hypothes.is"
        ],
        "admin": [
            "acct:xxx@hypothes.is"
        ]
    }
}

In this example the article and the PDF are linked by including the DOI and PDF fingerprint in the same annotation (thinking about this I should probably also have included the PDF URL in document.highwire.pdf_url[]). I extracted the PDF fingerprint using mutool and added that as the urn:x-pdf identifier.

The actual annotation itself is described twice, once using character position (start and end of the text string relative to the cleaned text extracted from the PDF) and once by including short fragments of text before and after the bit I want to highlight (Alpaida venger sp. nov.). In my limited experience so far this combination seems to provide enough information for hypothes.is to also locate the annotation in the HTML version of the article (if one exists).

You can see the result for yourself using the hypothes.is proxy (https://via.hypothes.is). Here is the annotation on the PDF (https://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf)


and here is the annotation on HTML (https://doi.org/10.1590/1678-476620151053372375)



If you download the PDF onto your computer and open the file in Chrome you can also see the annotation in the PDF (to do this you will need to install the hypothes.is extension for Chrome and click the hypothes.is symbol on your Chrome's toolbar).

In summary, we have a pretty straightforward way to automatically annotate papers offline using just the PDF.