Wednesday, September 23, 2020

Using the API to annotate PDFs

With somewhat depressing regularity I keep cycling back to things I was working on earlier but never quite get to work the way I wanted. The last couple of days it's the turn of

One of the things I'd like to have is a database of all taxonomic names such that if you clicked on a name you would get not only the bibliographic record for the publication where that name first appeared (which is what I've bene building for animals in BioNames) but also you could see the actual publication with the name highlighted in the text. This assumes that the publication has been digitised (say, as a PDF) and is accessible, but let's assume that this is the case. Now, we could do this manually, but we have tools to find taxonomic names in text. And in my use case I often know which page the name is on, and what the name is, so all I really want is to be able to highlight it programmatically (because I have millions of names to deal with).

So, time to revisit the API. One of the neat "tricks" have managed is the ability to annotate, say, a web page for an article and have that annotation automagically appear on the PDF version of the same article. As described in How Hypothesis interacts with document metadata this is in part because extracts metadata from the article's web page, such as DOI and link to the PDF, and stores that with the annotation (I say "in part" because the other part of the trick is to be able to locate annotations in different versions of the same text). If you annotate a PDF, stores the URL of the PDF and also a "fingerprint" of the PDF (see PDF Fingerprinting for details). This means that you can also add an annotation to a PDF offline (for example, on a file you have downloaded onto your computer) and - if has already encountered this PDF - that annotation will appear in the PDF online.

What I want to do is have a PDF, highlight the scientific name, upload that annotation to so that the annotation is visible online when anyone opens the PDF (and ideally when they look at the web version of the same article). I want to do this programmatically. Long story short, this seems doable. Here is an example annotation that I created and sent to via their API:

    "uri": "",
    "document": {
        "highwire": {
            "doi": [
        "dc": {
            "identifier": [
        "link": [
                "href": "urn:x-pdf:6124e7bdb33241429158b11a1b2c4ba5"
    "tags": [
    "target": [
            "source": "",
            "selector": [
                    "type": "TextQuoteSelector",
                    "exact": "Alpaida venger sp. nov.",
                    "prefix": "imens preserved in 75% ethanol. ",
                    "suffix": " (Figs 1-9) Type-material. Holot"
                    "type": "TextPositionSelector",
                    "start": 4834,
                    "end": 4857
    "user": "",
    "permissions": {
        "read": [
        "update": [
        "delete": [
        "admin": [

In this example the article and the PDF are linked by including the DOI and PDF fingerprint in the same annotation (thinking about this I should probably also have included the PDF URL in document.highwire.pdf_url[]). I extracted the PDF fingerprint using mutool and added that as the urn:x-pdf identifier.

The actual annotation itself is described twice, once using character position (start and end of the text string relative to the cleaned text extracted from the PDF) and once by including short fragments of text before and after the bit I want to highlight (Alpaida venger sp. nov.). In my limited experience so far this combination seems to provide enough information for to also locate the annotation in the HTML version of the article (if one exists).

You can see the result for yourself using the proxy ( Here is the annotation on the PDF (

and here is the annotation on HTML (

If you download the PDF onto your computer and open the file in Chrome you can also see the annotation in the PDF (to do this you will need to install the extension for Chrome and click the symbol on your Chrome's toolbar).

In summary, we have a pretty straightforward way to automatically annotate papers offline using just the PDF.

Friday, September 11, 2020

Darwin Core Million reminder, and thoughts on bad data

Bob mesibovThe following is a guest post by Bob Mesibov.

No winner yet in the second Darwin Core Million for 2020, but there are another two and a half weeks to go (to 30 September). For details of the contest see this iPhylo blog post. And please don’t submit a million RECORDS, just (roughly) a million DATA ITEMS. That’s about 20,000 records with 50 fields in the table, or about 50,000 records with 20 fields, or something arithmetically similar.

The purpose of the Darwin Core Million is to celebrate high-quality occurrence datasets. These are extraordinarily rare in biodiversity informatics.

I’ll unpick that. I’m not talking about the accuracy of the records. For most records, the “what”, “where”, “when” and “by whom” are probably correct. An occurrence record is a simple fact: Wilma Flintstone collected a flowering specimen of an Arizona Mountain Dandelion 5 miles SSE of Walker, California on 27 June 2019. More technically, she collected Agoseris parviflora at 38.4411 –119.4393, as recorded by her handheld GPS.

What could possibly go wrong in compiling a dataset of simple records like that in a spreadsheet or database? Let me count a few of the ways:

  • data items get misspelled or misnumbered
  • data items get put in the wrong field
  • data items are put in a field for which they are invalid or inappropriate
  • data items that should be entered get left out
  • data items get truncated
  • data items contain information better split into separate fields
  • data items contain line breaks
  • data items get corrupted by copying down in a spreadsheet
  • data items disagree with other data items in the same record
  • data items refer to unexplained entities (“habitat type A”)
  • paired data items don’t get paired (e.g. latitude but no longitude)
  • the same data item appears in different formats in different records
  • missing data items are represented by blanks, spaces, “?”, “na”, “-”, “unknown”, “not recorded” etc, all in the same data table
  • character encoding failures create gibberish, question marks and replacement characters (�)
  • weird control characters appear in data items, and parsing fails
  • dates get messed up (looking at you, Excel)
  • records get duplicated after minor edits

In previous blog posts (here and here) I’ve looked at explanations for poor-quality data at the project, institution and agency level — data sources I referred to collectively as the “PIA”. I don’t think any of those explanations are controversial. Here I’m going to be rude and insulting and say there are three further obstacles to creating good, usable and shareable occurrence data:

Datasets are compiled as though they were family heirlooms.

The PIA says “This database is OUR property. It’s for OUR use and WE understand the data, even if it’s messy and outsiders can’t figure out what we’ve done. Ambiguities? No problem, we’ll just email Old Fred. He retired a few years back but he knows the system back to front.”

Prising data items from these heirlooms, mapping them to new fields and cleaning them are complicated exercises best left to data specialists. That’s not what happens.

Datasets are too often compiled by people with inadequate computer skills. Their last experience of data management was building a spreadsheet in a “digital learning” class. They’re following instructions but they don’t understand them. Both the data enterers and their instructors are hoping for a good result, which is truly courageous optimism.

The (often huge) skills gap between the compilers of digital PIA data and the computer-savvy people who analyse and reformat/repackage the data (users and facilitators-for-users) could be narrowed programmatically, but isn’t. Hands up all those who use a spreadsheet for data entry by volunteers and have comprehensive validation rules for each of the fields? Thought so.

People confuse software with data. This isn’t a problem restricted to biodiversity informatics, and I’ve ranted about this issue elsewhere. The effect is that data compilers blame software for data problems and don’t accept responsibility for stuff-ups.

Sometimes that blaming is justified. As a data auditor I dread getting an Excel file, because I know without looking that the file will have usability and shareability issues on top of the usual spreadsheet errors. Excel isn’t an endpoint in a data-use pipeline, it’s a starting point and a particularly awful one.

Another horror is the export option. Want to convert your database of occurrence records to format X? Just go to the “Save as” or “Export data” menu item and click “OK”. Magic happens and you don’t need to check the exported file in format X to see that all is well. If all is not well, it’s the software’s fault, right? Not your problem.

In view of these and the previously blogged-about explanations for bad data, it’s a wonder that there are any high-quality datasets, but there are. I’ve audited them and it’s a shame that for ethical reasons I can’t enter them myself in the current Darwin Core Million.