Wednesday, September 02, 2015

Hypothes.is revisited: annotating articles in BioStor

YClX4 gV Over the weekend, out of the blue, Dan Whaley commented on an earlier blog post of mine (Altmetrics, Disqus, GBIF, JSTOR, and annotating biodiversity data. Dan is the project lead for hypothes.is, a tool to annotate web pages. I was a bit dismissive as hypothes.is falls into the "stick note" camp of annotation tools, which I've previously been critical of.

However, I decided to take another look at hypothes.is and it looks like a great fit to another annotation problem I have, namely augmenting and correcting OCR text in BioStor (and, by extension, BHL). For a subset of BioStor I've been able to add text to the page images, so you can select that text as you would on a web page or in a PDF with searchable text. If you can select text, you can annotate it using hypothes.is. Then I discovered that not only is hypothes.is a Chrome extension (which immediately limits who will use it), you can add it to any web site that you publish. So, as an experiment I've added it to BioStor, so that people can comment on BioStor using any modern browser.

So far, so good, but the problem is I'm relying on the "crowd" to come along and manually annotate the text. But I have code that can take text and extract geographic localities (e.g., latitude and longitude pairs), specimen does, and taxonomic names. What I'd really like to do is be able pre-process the text, locate these features, then programmatically add those as annotations. Who wants to do this manually when a computer can do most of it?

Hypothesi.is, it turns out, has an API that, while a bit *cough* skimpy on documentation, enables you to add annotations to text. So now I could pre-process the text, and just ask people to add things that have been missed, or flag errors on the automated annotations.

This is all still very preliminary, but as an example here's a screen shot of a page in BioStor together with geographic annotations displayed using hypothes.is (you can see this live here: http://biostor.org/reference/147608/page/1 (make sure you click on the widgets at the top right of the page to see the annotations):

Hypothesis

The page shows two point localities that have been extracted from the text, together with a static Google Map showing the localities (hypothesis.is supports Markdown in annotations, which enables links and images to be embedded).

Not only can we write annotations, we can also read them, so if someone adds an annotation (e.g., highlights a specimen code that was missed, or some text that OCR has messed up) we could retrieve that and, for example, index the corrected text to improve findability.

Lots still to do, but these early experiments are very encouraging.

Tuesday, September 01, 2015

Dark taxa, drones, and Dan Janzen: 6th International Barcode of Life Conference

6thBOL logo 300x237 A little over a week ago I was at the 6th International Barcode of Life Conference, held at Guelph, Canada. It was my first barcoding conference, and was quite an experience. Here are a few random thoughts.

Attendees

It was striking how diverse the conference crowd was. Apart from a few ageing systematists (including veterans of the cladistics wars), most people were young(ish), and from all over the world. There clearly something about the simplicity and low barrier to entry of barcoding that has enabled its widespread adoption. This also helps give barcoding a cohesion, no matter what the taxonomic group or the problem you are tackling, you are doing much the same thing as everybody else (but see below). While ageing systematists (like myself) may hold their noses regarding the use of a single, short DNA sequence and a tree-building method some would dismiss as "phenetic", in many ways the conference was a celebration of global-scale phylogeography.

Standards aren't enough

And yet, standards aren't enough. I think what contributes to DNA barcoding's success is that sequences are computable. If you have a barcode, there's already a bunch of barcodes sequences you can compare yours to. As others add barcodes, your sequences will be included in subsequent analyses, analyses which may help resolve the identity of what you sequenced.

To put this another way, we have standard image file formats, such as JPEG. This means you can send me a bunch of files, safe in the knowledge that because JPEG is a standard I will be able to open those files. But this doesn't mean that I can do anything useful with them. In fact, it's pretty hard to do anything with images part from look at them. But if you send me a bunch of DNA sequences for the same region, I can build a tree, BLAST GenBank for similar sequences, etc. Standards aren't enough by themselves, to get the explosive growth that we see in barcodes the thing you standardise on needs to be easy to work with, and have a computational infrastructure in place.

Next generation sequencing and the hacker culture

Classical DNA barcoding for animals uses a single, short mtDNA marker that people were sequencing a couple of decades ago. Technology has moved on, such that we're seeing papers such as An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. As I've argued earlier (Is DNA barcoding dead?) this misses the point about the power of standardisation on a simple, scalable method.

At the same time, it was striking to see the diversity of sequencing methods being used in conference presentations. Barcoding is a broad church, and it seemed like it was a natural home for people interested in environmental DNA. There was excitement about technologies such as the Oxford Nanopore MinION™, with people eager to share tips and techniques. There's something of a hacker culture around sequencing (see also Biohackers gear up for genome editing), just as there is for computer hardware and software.

Community

The final session of the conference started with some community bonding, complete with Paul Hebert versus Quentin Wheeler wielding light sables. If, like me, you weren't a barcode, things started getting a little cult-like. But there's no doubt that Paul's achievement in promoting a simple approach to identifying organisms, and then translating that into a multi-million dollar, international endeavour is quite extraordinary.

After the community bonding, came a wonderful talk by Dan Janzen. The room was transfixed as Dan made the case for conservation, based on his own life experiences, including Area de Conservación Guanacaste where he and Winnie Hallwachs have been involved since the 1970s. I sat next to Dan at a dinner after the conference, and showed him iNaturalist, a great tool for documenting biodiversity with your phone. He was intrigued, and once we found pictures taken near his house in Costa Rica, he was able to identify the individual animals in the photos, such as a bird that has since been eaten by a snake.

Dark taxa

My own contribution to the conference was a riff on the notion of dark taxa, and mostly consisted of me trying think through how to respond to DNA barcoding. The three responses to barcoding that I came up with are:
  1. By comparison to barcoding, classical taxonomy is digitally almost invisible, with great chunks of the literature still not scanned or accessible. So, one response is to try and get the core data of taxonomy digitised and linked as fast as possible. This is why I built BioStor and BioNames, and why I continually rant about the state of efforts to digitise taxonomy.
  2. This is essentially President Obama's "bucket" approach, maybe the sane thing to do is see barcoding as the future and envisage what we could do in a sequence only world. This is not to argue that we should ignore the taxonomic literature as such, but rather lets start with sequences first and see what we can do. This is the motivation for my Displaying a million DNA barcodes on Google Maps using CouchDB, and my experiments with Visualising Geophylogenies in Web Maps Using GeoJSON. These barely scratch the surface of what can be done.
  3. The third approach is to explore how we integrate taxonomy and barcoding at global scale, in which case linking at specimen level (rather, than, say using taxonomic names) seems promising, albeit requiring a massive effort to reconcile multiple specimen identifiers.

Summary

Yes, the barcoding conference was that rare thing, a well organised (including well-fed), interesting, indeed eye-opening, conference.

Friday, August 14, 2015

Possible project: NameStream - a stream of new taxonomic names

Yet another barely thought out project, although this one has some crude code. If some 16,000 new taxonomic names are published each year, then that is roughly 40 per day. We don't have a single place that aggregates these, so any major biodiversity projects is by definition out of date. GBIF itself hasn't had an update list of fungi or plant names for several years, and at present doesn't have an up to date list of animal names. You just have to follow the Twitter feeds of ZooKeys and Zootaxa to feel swamped in new names.

And yet, most nomenclators are pumping out RSS feeds of new names, or have APIs that support time-based queries (i.e., send me the names added in the last month). Won't it be great to have a single aggregator that took these "name streams", augmented them by adding links to the literature (it could, for example, harvest RSS feeds and Twitter streams of the relevant journals), and provided the biodiversity community with a feed of new names and associated supporting information. We could watch new discoveries of new biodiversity unfold in near real time, as well as provide a stream of data for projects such as GBIF and others to ingest and keep their databases up to date.

Possible project: A PubMed Central for taxonomy

F93f2e30d1cca847800e6f3060b8101a I need more time to sketch this out fully, but I think a case can be made for a taxonomy-centric (or, perhaps more usefully, a biodiversity-centric) clone of PubMed Central.

Why? We already have PubMed Central, and a European version Europe PubMed Central, and the content of Open Access journals such as ZooKeys appears in both, so, again, why?

Here are some reasons:

  1. PubMed Central has pioneered the use of XML to archive and publish scientific articles, specifically JATS XML. But the biodiversity literature comes in all sorts of formats, including several flavours of XML (such as SciElo XML, XML from OCR literature such as DjVu, ABBYY, and TEI, etc.)
  2. While Europe PMC is doing nice things with ORCIDs and entity extraction, it doesn't deal with the kinds of entities prevalent in the taxonomic literature (such as geographic localities, specimen codes, micro citations, etc.). Nor does it deal with extracting the core scientific statements that a taxonomic paper makes.
  3. Given that much of the taxonomic literature will be derived from OCR we need mechanisms to be able to edit and fix OCR errors, as well as markup errors in more recent XML-native documents. For example, we could envisage having XML stored on GitHub and open to editing.
  4. We need to embed taxonomic literature in our major biodiversity databases, rather than have it consigned to ghettos of individual small-scale digitisation project, or swamped by biomedical literature whose funding and goals may not align well with the biodiversity community (Europe PMC is funded primary by medical bodies).
I hope to flesh this out a bit more later on, but I think it's time we started treating the taxonomic literature as a core resource that we, as a community, are responsible for. The NIH did this with biomedical research, shouldn't we be doing the same for biodiversity?

Monday, August 10, 2015

Demo of full-text indexing of BHL using CouchDB hosted by Cloudant

One of the limitations of the Biodiversity Heritage Library (BHL) is that, unlike say Google Books, its search functions are limited to searching metadata (e.g., book and article titles) and taxonomic names. It doesn't support full-text search, by which I mean you can't just type in the name of a locality, specimen code, or a phrase and expect to get back much in the way of results. In fact, in many cases when I Google a phrase that occurs in BHL content I'm more likely to find that phrase in content from the Internet Archive, and then it's a matter of following the links to the equivalent item in BHL.

So, as an experiment I've created a live demo of what full-text search in BHL could look like. I've done this using the same infrastructure the new BioStor is built on, namely CouchDB hosted by Cloudant. Using BHL's API I've grabbed some volumes of the British Ornithological Club's Bulletin and put them into CouchDB (BHL's API serves up JSON, so this is pretty straightforward to do). I've added the OCR text for each page, and asked Cloudant to index that. This means that we can now search on a phrase in BHL (in the British Ornithological Club's Bulletin) and get a result.

I've made a quick and dirty demo of this approach and you can see it in the "Labs" section on BioStor, so you can try it here. You should see something like this:

Bhl couchdb The page image only appears if you click on the blue labels for the page. None of this is robust or optimised, but it is a workable proof-of-concept of how fill-text search could work.

What could we do with this? Well, all sorts of searches are no possible. We can search for museum specimen codes, such as 1900.2.27.13. This specimen is in GBIF (see http://bionames.org/~rpage/material-examined/www/?code=BMNH%201900.2.27.13) so we could imagine starting to link specimens to the scientific literature about that specimen. We can also search for locations (such as Mt. Albert Edward), or common names (such as crocodile).

Note that I've not completed uploading all the page text and XML. Once I do I'll have a better idea of how scalable this approach is. But the idea of having full-text search across all of BHL (or, at least the core taxonomic journals) is tantalising.

Technical details

Initially I simply displayed a list of the pages that matched the search term, together with a fragment of text with the search term highlighted. Cloudant's version of CouchDB provides these highlights, and a "group_field" that enabled me to group together pages from the same BHL "item" (roughly corresponding to a volume of a journal).

This was a nice start, but I really wanted to display the hits on the actual BHL page. To do this I grabbed the DjVu XML for each BHL page for British Ornithological Club's Bulletin, and used a XSLT style-sheet that renders the OCR text on top of the page image. You can't see the text because it I set the colour of the text to "rgba(0, 0, 0, 0)" (see http://stackoverflow.com/a/10835846) and set the "overflow" style to "hidden". But the text is there, which means you can select with the mouse and copy and paste it. This still leaves the problem of highlighting the text that matches the search term. I originally wrote the code for this to handle species names, which comprise two words. So, each DIV in the HTML has a "data-one-word" and "data-two-words" attribute set, which contains the first (and forst plus second) word in the search term, respectively. I then use a JQuery selector to set the CSS of each DIV that has a "data-one-word" or "data-two-words" attribute that matches the search term(s). Obviously, this is terribly crude, and doesn't do well if you've more than two word sin your search query.

As an added feature, I use CSS to convert the BHL page scan to a black-and-white image (works in Webkit-based browsers).

Possible project: mapping authors to Wikipedia entries using lists of published works

220px Boulenger George 1858 1937One of the less glamorous but necessary tasks of data cleaning is mapping "strings to things", that is, taking strings such as "George A. Boulenger" and mapping them to identifiers, such as ISNI: 0000 0001 0888 841X. In case of authors such as George Boulenger, one way to do this would be through Wikipedia, which has entries for many scientists, often linked to identifiers for those people (see the bottom of the Wikipedia page for George A. Boulenger and look at the "Authority Control" section).

How could we make these mappings? Simple string matching is one approach, but it seems to me that a more robust approach could use bibliographic data. For example, if I search for George A. Boulenger in BioStor, I get lots of publications. If at least some of these were listed on the Wikipedia page for this person, together with links back to BioStor (or some other external identifier, such as DOIs), then we could do the following:

  1. Search Wikipedia for names that matched the author name of interest
  2. If one or more matches are found, grab the text of the Wikipedia pages, extract any literature cited (e.g., in the {cite} tag), get the bibliographic identifiers, and see if they match any in our search results.
  3. If we get one or more hits, then it's likely that the Wikipedia page is about the author of the papers we are looking at, and so we link to it.
  4. Once we have a link to Wikipedia, extract any external identifier for that person, such as ISNI or ORCID.
For this to work, it requires that the Wikipedia page cites works by the author in a way that we can harvest, and uses identifiers that we can match to those in the relevant database (e.g., BioStor, Crossef, etc.). We might also have to look at Wikipedia pages in multiple languages, given that English-language Wikipedia may be lacking information on scholars from non-English speaking countries (this will be a significant issue for many early taxonomists).

Based on my limited browsing of Wikipedia, there seems to be little standardisation of entries for people, certainly little in how their published works are listed (the section heading, format, how many, etc.). The project I'm proposing would benefit from a consistent set of guidelines for how to include a scholar's output.

What makes this project potentially useful is that it could help flesh out Wikipedia pages by encouraging people to add lists of published works, it could aid bibliographic repositories like my own BioStor by increasing the number of links they get from Wikipedia, and if the Wikipedia page includes external identifiers then it helps us go from strings to things by giving us a way to locate globally unique identifiers for people.

Sunday, August 09, 2015

More Neo4J tests of GBIF taxonomy: Using IPNI to find objective synonyms

Following on from Testing the GBIF taxonomy with the graph database Neo4J I've added a more complex test that relies on linking taxa to names. In this case I've picked some legume genera (Coursetia and Poissonia) where there have been frequent changes of name. By mapping the GBIF taxa to IPNI names (and associated LSIDs) we can build a graph linking taxa to names, and then to objective synonyms (by resolving the IPNI LSIDs and following the links to the basionym), see http://gist.neo4j.org/?4df5af75d42e0f963e5d.

In this example we find species that occur twice in the GBIF taxonomy, which logically should not happen as the names are objective synonyms. We can detect these problems if we have access to nomenclatural data. in this case, because IPNI has tracked the names changes, we can infer that, say, Coursetia heterantha and Poissonia heterantha are synonyms, and hence only one of these should appear in the GBIF classification. This is an example that illustrates the desirability of separating names and taxa, see Modelling taxonomic names in databases.

Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact

E9815d877cd092a19918df74e04f0415Imagine a web site where researchers can go, log in (easily) and get a list of all the species they have described (with pretty pictures and, say, GBIF map), and a list of all DNA sequences/barcodes (if any) that they've published. Imagine that this is displayed in a colourful way (e.g., badges), and the results tweeted with the hastag #itaxonomist.

Imagine that you are not a taxonomist, but if you have worked with one (e.g., published a paper), you can go to the site, log in, and discover that you “know” a taxonomist. Imagine if you are a researcher who has cited taxonomic work, you can log in and discover that your work depends on a taxonomist (think six degrees of Kevin Bacon).

Imagine that this is run as a campaign (hashtag #itaxonomist), with regular announcements leading up to the release date. Imagine if #itaxonomist trends. Imagine the publicity for the work taxonomists do, and the new found ability for them to quantitatively demonstrate this.

How does it work?

#itaxonomist relies on three things:

  1. People having an ORCID
  2. People having publications with DOIs (or otherwise easily identifiable) in their ORCID profile
  3. A map between DOIs (etc.) and the names in the nomenclators (ION, IPNI, Index Fungorum, ZooBank)

Implementation

Under the hood this builds part of the “biodiversity knowledge graph”, and uses ideas I and others have been playing around with (e.g., see David Shorthouse’s neat proof of concept http://collector.shorthouse.net/agent/0000-0002-7260-0350 and my now defunct Mendeley project http://iphylo.blogspot.co.uk/2011/12/these-are-my-species-finding-taxonomic.html).

For a subset of people and names this we could build this very quickly. Some some taxonomists already have ORCIDs , and some nomenclators have limited numbers of DOIs. I am currently building lists of DOIs for primary taxonomic literature, which could be used to seed the database.

The “i am a taxonomist” query is simply a map between ORCID to DOI to name in nomenclator. The “i know a taxonomist” is a map between ORCID and DOI that you share with a taxonomist, but there are no names associated with that DOI (e.g., a paper you have co-authored with a taxonomist that wasn’t on taxonomy, or at least didn’t describe a new species). The “six degrees of taxonomy” relies on the existence of open citation data, which is trickier, but some is available in PubMed Central and/or could be harvested from Pensoft publications.

Friday, August 07, 2015

Testing the GBIF taxonomy with the graph database Neo4J

Neo4j

I've been playing with the graph database Neo4J to investigate aspects of the classification of taxa in GBIF's backbone classification. Neo4J is a graph database, and a number of people in biodiversity informatics have been playing with it. Nicky Nicolson at Kew has a nice presentation using graph databases to handle names Building a names backbone, and the Open Tree of Life project use it in their tree machine.

One of the striking things about Neo4J is how much effort has gone in to making it easy to play with. In particular, you can create GraphGists, which are simple text documents that are transformed into interactive graphs that you can query. This is fun, and I think it's also a great lesson in how to publicise a technology (compare this with RDF and SPARQL, which is in no way fun to work with).

I created some GraphGists that explore various problems with the current GBIF taxonomy. The goal is to find ways to quickly test the classifications for logical errors, and wherever possible I want to use just the information in the GBIF classification itself.

The first example is a version of the "papaya plots" that I played with in an earlier post (see also an unfinished manuscript Taxonomy as impediment: synonymy and its impact on the Global Biodiversity Information Facility's database). For various reasons, GBIF has ended up with the same species occuring more that once in its backbone classification, usually because none of its source databases has enough information on synonymy to prevent this happening.

As an example, I've grabbed the classification for the bat family Molossidae, converted it to a Neo4J graph, and then tested for the existence of species in different genera that have the same specific epithet. This is a useful (but not foolproof test) of whether there are undetected synonyms, especially if the generic placement of a set of species has been in flux (this is certainly true for these bats). If you visit the gist you will see a list of species that are potential synonyms.

A related test catches cases where one classification treats a taxon as a subspecies whereas another treats it as a full species, and GBIF has ended up with both interpretations in the same classification (e.g., the butterfly species Heliopyrgus margarita and the subspecies Heliopyrgus domicella margarita).

Another GraphGist tests that the genus name for a species matches the genus it is assigned too. This seems obvious (the species Homo sapiens belongs in the genus Homo) but there are cases where GBIF's classification fails this test, such as the genus Forsterinaria. Typically this test fails due to problematic generic names (e.g., homonyms), incorrect spellings, etc.

The last test is slightly more pedantic, but revealing nevertheless. It relies on the convention in zoology that when you write the authorship of a species name, if the name is not in the original genus then you enclose the authorship in parentheses. For example, it's Homo sapiens Linnaeus, but Homo erectus (Dubois, 1894) because Dubois originally called this species Pithecanthropus erectus.

Because you can only move a species to a genus that has been named, it follows that if a species is described before the genus name was published, then if the species is in that newer genus the authorship must be in parentheses. For example, the lepidopteran genus Heliopyrgus was published in 1957, and includes the species willi Plötz, 1884. Since this species was described before 1957, it must have been originally placed in a different genus, and so the species name should be Heliopyrgus willi (Plötz, 1884). However, GBIF has this as Heliopyrgus willi Plötz, 1884 (no parentheses). The GraphGist tests for this, and finds several species of Heliopyrgus that are incorrectly formed. This may seem pendantic, but it has practical consequences. Anyone searching for the original description of Heliopyrgus willi Plötz, 1884 might think that they should be looking for the text string "Heliopyrgus willi" in literature from 1884, but the name didn't exist then and so the search will be fruitless.

I think there's a lot of scope for deveoping tests like these, inclusing some that m make use of external data as well. In an earlier post (A use case for RDF in taxonomy ) I mused about using RDF to perform tests like this. However Neo4J is so much easier to work with I suspect that it makes better sense to develop standard queries in it's query language (CYPHER) and use those.

Tuesday, August 04, 2015

Possible project: extract taxonomic classification from tags (folksonomy)

Note to self about a possible project. This PLoS ONE paper:

Tibély, G., Pollner, P., Vicsek, T., & Palla, G. (2013, December 31). Extracting Tag Hierarchies. (P. Csermely, Ed.)PLoS ONE. Public Library of Science (PLoS). http://doi.org/10.1371/journal.pone.0084133
describes a method for inferring a hierarchy from a set of tags (and cites related work that is of interest). I've grabbed the code and data from http://hiertags-beta.elte.hu/home/ and put it on GitHub.

Possible project

Use Tibély et al. method (or others) on taxonomic names extracted from BHL text (or other) and see if we can reconstruct taxonomic classifications. ow do classifications compare to those in databases? Can we enhance existing databases using this technique (e.g., extract classifications from literature for groups pporly represented in existing databases)? Could be part of larger study of what we can learn from co-occurrence of taxonomic names, e.g. Automatically extracting possible taxonomic synonyms from the literature.

Note to anyone reading this: if this project sounds interesting, by all means feel free to do it. These are just notes about things that I think would be fun/interesting/useful to do.