Wednesday, September 02, 2015

Hypothes.is revisited: annotating articles in BioStor

YClX4 gV Over the weekend, out of the blue, Dan Whaley commented on an earlier blog post of mine (Altmetrics, Disqus, GBIF, JSTOR, and annotating biodiversity data. Dan is the project lead for hypothes.is, a tool to annotate web pages. I was a bit dismissive as hypothes.is falls into the "stick note" camp of annotation tools, which I've previously been critical of.

However, I decided to take another look at hypothes.is and it looks like a great fit to another annotation problem I have, namely augmenting and correcting OCR text in BioStor (and, by extension, BHL). For a subset of BioStor I've been able to add text to the page images, so you can select that text as you would on a web page or in a PDF with searchable text. If you can select text, you can annotate it using hypothes.is. Then I discovered that not only is hypothes.is a Chrome extension (which immediately limits who will use it), you can add it to any web site that you publish. So, as an experiment I've added it to BioStor, so that people can comment on BioStor using any modern browser.

So far, so good, but the problem is I'm relying on the "crowd" to come along and manually annotate the text. But I have code that can take text and extract geographic localities (e.g., latitude and longitude pairs), specimen does, and taxonomic names. What I'd really like to do is be able pre-process the text, locate these features, then programmatically add those as annotations. Who wants to do this manually when a computer can do most of it?

Hypothesi.is, it turns out, has an API that, while a bit *cough* skimpy on documentation, enables you to add annotations to text. So now I could pre-process the text, and just ask people to add things that have been missed, or flag errors on the automated annotations.

This is all still very preliminary, but as an example here's a screen shot of a page in BioStor together with geographic annotations displayed using hypothes.is (you can see this live here: http://biostor.org/reference/147608/page/1 (make sure you click on the widgets at the top right of the page to see the annotations):

Hypothesis

The page shows two point localities that have been extracted from the text, together with a static Google Map showing the localities (hypothesis.is supports Markdown in annotations, which enables links and images to be embedded).

Not only can we write annotations, we can also read them, so if someone adds an annotation (e.g., highlights a specimen code that was missed, or some text that OCR has messed up) we could retrieve that and, for example, index the corrected text to improve findability.

Lots still to do, but these early experiments are very encouraging.