Friday, July 31, 2015

Towards a new BioStor

One of my pet projects is BioStor, which has been running since 2009 (gulp). BioStor extracts articles from the Biodiversity Heritage Library (details here: http://dx.doi.org/10.1186/1471-2105-12-187), and currently has over 110,000 articles, all open access. The site itself is showing its age, both in terms of performance and design, so I've wanted to update it for a while now. I made a demo in 2012 of BioStor in the Cloud, but other stuff got in the way of finishing it, and the service that it ran on (Pagodabox) released a new version of their toolkit, so BioStor in the Cloud died.

At last I've found the time to tackle this again, motivated in part because I've had to move BioStor to a new server, and it's performance has been pretty poor. The next version of BioStor is currently sitting at http://biostor.gopagoda.io (the images and map views are good ways to enter the site). It's still being populated, and there is code to tweak, but it's starting to look good enough to use. It has a cleaner article display, built in search (making things much more findable), support for citation styles using citeproc-js, and display of altmetrics (e.g., Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae).

Once all the data has been moved across and I've cleaned up a few things I plan to make bistor.org point to this new version.

Tuesday, July 28, 2015

Modelling taxonomic names in databases

Quick notes on modelling taxonomic names in databases, as part of an ongoing discussion elsewhere about this topic.

Simple model

One model that is widely used (e.g., ITIS, WoRMS) and which is explicit in Darwin Core Archive is something like this:

Model1

We have a table for taxa and we don't distinguish between taxa and their names. the taxonomic hierarchy is represented by the parentID field, which points to your parent. If you don't have a (non NULL) value for parentID you are not an accepted taxon (i.e., you are a synonym), and the field acceptedID points to the accepted taxon. Simple, fits in a single database table (or, let's be honest, and Excel spreadsheet).

The tradeoff is that you conflate names and taxa, you can't easily describe name-only relationships (e.g., homonyms, nomenclatural synonyms) without inventing "taxa" for each name.

Separating names and taxa

The next model, which I've drawn rather clunky below as if you were doing this in a relational database, is based on the TDWG LSID vocabularies. One day someone will explain why the biodiversity informatics community basically ignored this work, despite the fact that all the key nomenclators use it.

Model2

In this model we separate out names as first-class objects with globally unique identifiers. The taxa table refers to the names table when it mentions a name. Any relationships between names are handled separately from taxa, so we can easily handle things like replacement names for homonyms, basionyms, etc. Not that we can also remove a lot of extraneous stuff from the taxa table. For example, if we decide that Poissonia heterantha is the accepted name for a taxon, we don't need to create taxa for Coursetia heterantha or Tephrosia heterantha, because by definition those names are synonyms of Poissonia heterantha.

The other great advantage of this model is that it enables us to take the work the nomenclators have done straight without having to first shoe-horn it into the Darwin Core format, which assumes that everything is a taxon.

Monday, July 27, 2015

The Biodiversity Data Journal is not machine readable

626ce1b38c2b42f77802e4e8c597820e 400x400In my (previous post ) I discussed the potential for the Biodiversity Data Journal (BDJ) to be a venue for nano (or near-nano publications). In this post I want to draw attention to what I think is a serious stumbling block, which is the lack of machine readable statements in the journal.

Given that the journal is probably the most progressive in the field (indeed, is suspect that there are few journals in any field as advanced in publishing technology as BDJ) this may seem an odd claim to make. The journal provides XML of its text, and typically provides data in Darwin Core Archive format, which is harvested by GBIF. The article XML is marked up to flag taxonomic names, localities, etc. Surely this is the very definition of machine readable?

The problem becomes apparent when you ask "what claims or assertions are the papers making?", and "how are those assertions reflected in the article XML and/or the Darwin Core Archive?".

For example, consider the following paper:

Gil-Santana, H., & Forero, D. (2015, June 16). Aristathlus imperatorius Bergroth, a newly recognized synonym of Reduvius iopterus Perty, with the new combination Aristathlus iopterus (Perty, 1834) (Hemiptera: Reduviidae: Harpactorinae) ​. BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.3.e5152

The title gives the key findings of this paper: Aristathlus imperatorius = Reduvius iopterus, and Reduvius iopterus = Aristathlus iopterus. Yet these statements are no where to be found in the Darwin Core Archive for the paper, which simply lists the name Aristathlus iopterus. The XML markup flags terms as names, but says nothing about the relationships between the names.

Here is another example:

Starkevich, P., & Podenas, S. (2014, December 30). New synonym of Tipula (Vestiplex) wahlgrenana Alexander, 1968 (Diptera: Tipulidae). BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.2.e4237
Indeed, I've yet to find an example of a paper in BDJ where a synonomy asserted in the text is reflected in the Dawrin Core Archive!

The issue here is that neither the XML markup nor the associated data files are capturing the semantics of the paper, in the sense of what the paper is actually saying. The XML and DwCA files capture (some) of the names, and localities mentioned, but not the (arguably) most crucial new pieces of information.

There is a disconnect between what the papers are saying (which a human reader can easily parse) and what the machine-readable files are saying, and this is worrying. Surely we should be ensuring that the Darwin Core Archives and/or XML markup are capturing the key facts and/or assertions made by the paper? Otherwise databases down stream will remain none the wiser about the new information the journal is publishing.

Nanopublications and annotation: a role for the Biodiversity Data Journal?

626ce1b38c2b42f77802e4e8c597820e 400x400

I stumbled across this intriguing paper:

Do, L., & Mobley, W. (2015, July 17). Single Figure Publications: Towards a novel alternative format for scholarly communication. F1000Research. F1000 Research, Ltd. http://doi.org/10.12688/f1000research.6742.1
The authors are arguing that there is scope for a unit of publication between a full-blown journal article (often not machine readable, but readable) and the nanopublication (a single, machine readable statement, not intended for people to read), namely the Single Figure Publications (SFP):

The SFP, consisting of a figure, the legend, the Material and Methods section, and an optional Results/Discussion section, reduces the unit of publication to a more tractable size. Importantly, it results in a markedly decreased time from data generation to publication. As such, SFPs represent a new means by which to communicate scientific research. As with the traditional journal article, the content of the SFPs is readily understandable by the scientist. Coupled with additional tools that aid in structuring content (e.g. describing in detail the methods using pre-defined steps from protocols), the SFP represents a “bottom-up” means by which scholars can structure the content of their findings in a modular and piece-wise fashion wedded to everyday laboratory life.
It seems to me that this is something that the Biodiversity Data Journal is potentially heading towards. Some of the papers in that journal are short, reporting say, new occurence records for a single species e.g.:

Ang, Y., Rohner, P., & Meier, R. (2015, June 26). Across the Baltic: a new record for an enigmatic black scavenger fly, Zuskamira inexpectata (Pont, 1987) (Sepsidae) in Finland. BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.3.e4308

Imagine if we have even shorter papers that are essentially a series of statements of fact, or assertions (linked to supporting evidence). These could potentially be papers that annotated and/or clarified data in an external database, such as GBIF. For example, let's imagine we find two names in GBIF that GBIF treats as being different taxa, but a recent publication asserts are actually synonyms. We could make that information machine readable (say, using Darwin Core Archive format), link it to the source(s) of the assertion (i.e., the DOI of the paper making the synonymy), then publish that as a paper. As the Darwin Core Archive is harvested by GBIF, GBIF then has access to that information, and when the next taxonomic indexing occurs it can make use of that information.

One reason for having these "micropublications" is that sometimes resolving an issue in a dataset can take some time. I've often found errors in databases and have ended up spending a couple of hours finding names, literature, etc. to figure out what is going on. As fun as that is, in a sense it's effort that is wasted if it's not made more widely available. But if I can wrap that couple of hours scholarship into a citable unit, publish it, and have it harvested and incorporated into, say, GBIF, then the whole exercise seems much more rewarding. I get credit for the work, and GBIF users get (hopefully) a tiny bit of improvement, and they can see the provenance of that improvement (i.e., it is evidence-based).

This seems like a simple mechanism for providing incentives for annotating databases. In some ways the Biodiversity Database Journal could be though of as doing this already, however as I'll discuss in the next blog post, there's an issue that is preventing it being as useful as it could be.

Thursday, July 23, 2015

Purposeful Games and the Biodiversity Heritage Library

0a5c989ef8f5b9523b2119808a9f15ab

These are some quick thoughts on the games on the BHL site, part of the Purposeful Gaming and BHL Project. As mentioned on Twitter, I had a quick play of the Beanstalk game and got bored pretty quickly. I should stress that I'm not a gamer (although my family includes at least one very serious gamer, and a lot of casual players). Personally, if I'm going to spend a large amount of time with a computer I want to be creating something, so gaming seems like a big time sink. Hence, I may not be the best person to review the BHL games. Anyhow...

It seems to me that there are a couple of ways games like this might work:

  1. You want to complete the game so you can do something you really want to do. This is the essence of solving CAPTCHAs, I'll solve your stupid puzzle so that I can buy my tickets.
  2. The game itself is engaging, and what you are asked to do is a natural part of the game's world. When you swipe candy, bits of candy may explode, or fall down (this is a good thing, apparently), or when you pull back a slingshot and release the bird, you get to break stuff).

The BHL games are trying to get you to do one activity (type in the text shown in a fragment of a BHL book) and this means, say, a tree grows bigger. To me this feels like a huge disconnect (cf. point 2 above), there is no connection between what I'm doing and the outcome.

Worse, BHL is an amazing corpus of text and images, and this is almost entirely hidden from me. If I see a cool looking word, or some old typeface, there's no way for me to dig deeper (what text did that come from?, what does that phrase mean?). I get no sense of where the words come from, or whether I'm doing anything actually useful. For things like ReCAPTCHA (where you helped OCR books) this doesn't matter because I don't care about the books, I want my tickets. But for BHL I do care (and BHL should want at least some of the players to care as well).

So, remembering that I'm not a gamer, here are some quick ideas for games.

Find that species

One reason BHL is so useful is it contains original taxonomic descriptions. Sometimes the OCR is too poor for the name to extracted from the description. Imagine a game where the player has a list of species (with cute pictures) and is told to go find them in the text. Imagine that we have a pretty good idea where they are (from bibliographic data we could, for example, know the page the name should occur on), the player hunts for the word on the page, and when they find it and mark it. BHL then gets corrected text and confirmation that the name occurs on that page. Players could select taxa (e.g., birds, turtles, mosses) that they like.

Find lat/longs

BHL text is full of lat/long pairs, often the OCR is not quite good enough to extract them. Imagine that we can process BHL to find things that look like lat/long pairs. Imsgine that we can read enough of the text to get a sense of where in the world the text refers to. Now, have a game where we pick a spot on a map and find things related to that spot. Say we get presented with OCR text that may refer to that locality, we fix it, and the map starts get populated. A bit like Yelp and Four Square, we could imagine badges for the most articles found about a place.

Find the letter/font

There are lots of cool symbols and fonts in BHL, someone might be interested collecting these. Simple things might be diphthongs such as æ. Older BHL texts are full of these, often misinterpreted. Other examples are male and female symbols. Perhaps we could have a game where we try and guess what symbol the OCR text actually matches - in other words, show the OCR text first, player tries to guess actual symbol, then the image appears, and then player types in actual symbol. Goal is to get good at predicting OCR errors.

Games like this would really benefit if the player could see (say, on the side) the complete text. Imagine that you correct a word, then you see it comes from a gorgeous plate of a bird. Imagine you could then correct any of the there words on that page.

Word eaters

Imagine the layer is presented with a page with text and, a bit like Minecraft's monsters, things appear which start to "eat" the words. You need to check as many words as possible before the text is eaten. Perhaps structure things in such a way that checked words form a barrier to the word-eating creatures and buy you some time, or like Minecraft, fixing a bad OCR word blasts a radius free of the word eaters. As an option (again, like Minecraft) turn off the eaters and just correct the words at your leisure.

Countdown

Based on the UK game show, present a set of random letters (as images), player makes longest word they can, then check against dictionary, this tells you what letters they think the images represent.

Falling words

Have page fragments fall from the top of the screen, and have a key word displayed (say, "sternum", or enable player to type a word in) then display images of words whose OCR text resembles this (in other words, have a bunch of OCR text indexed using methods that allow for errors). As the word images fall, the player taps on an image that matches the word and they are collected. Maybe add features such as a timeline to show when the word was used (i.e., the date of the BHL text), give the meaning of the word, lightly chide players who enter words like "f**k" (that'd be me), etc.

Summary

Like comedy, I imagine that designing games is really, really hard. But the best games I've seen create a world that the player is immersed in and which makes sense within the rules of that world. Regardless of whether these ideas are any good, my concern is that the BHL games seem completely divorced from context, and the game play bears no relation to outcomes in the game.

Wednesday, July 22, 2015

Altmetrics, Disqus, GBIF, JSTOR, and annotating biodiversity data

JSTOR Logo RGB 60x76Browsing JSTOR's Global Plants database I was struck by the number of comments people have made on individual plant specimens. For example, for the Holotype of Scorodoxylum hartwegianum Nees (K000534285) there is a comment from Håkan Wittzell that the "Collection number should read 1269 according to Plantae Hartwegianae". In JSTOR the collection number is 1209.

Now, many (if not all) of these specimens will also be in GBIF. Indeed, K000534285 is in GBIF as http://www.gbif.org/occurrence/912442645, also with collection number 1209. A GBIF user will have no idea that there is some doubt about one item of metadata about this specimen.

So, an obvious thing to do would be to make the link between the JSTOR and GBIF records. Implementing this would need so fussing because (sigh) unlike DOIs for articles we don't have agreed upon identifiers for specimens. So we'd need to do some mapping between the specimen barcode K000534285, the JSTOR URL http://plants.jstor.org/stable/10.5555/al.ap.specimen.k000534285, and the GBIF record http://www.gbif.org/occurrence/912442645.

In addition to providing users with more information, it might also be useful in kickstarting annotation on the GBIF site. At the moment GBIF has no mechanism for annotating data, and if it did, then it would have to start from scratch. Imagine that a person visiting occurrence 912442645 sees that it has already attracted attention elsewhere (e.g., JSTOR). They might be encouraged to take part in that conversation (because at least one person cared enough to comment already). Likewise, we could feed annotations on the GBIF site to JSTOR.

A variation on this idea is to think of annotations such as those in the JSTOR database as being analogous to the tweets, blog posts, and bookmarking that altmetric tracks for academic papers. Imagine if we applied the same logic to GBIF and had a way to show users that a specimen has been commented on in JSTOR Plants? Thinking further down the track, we could image adding other sorts of "attention", such as citations by papers, vouchers for DNA sequences, etc.

It would be a fun project to see whether the Disqus API enabled us to create a tool that could match JSTOR Global Plants comments to GBIF occurrences.

Steve Baskauf on RDF and the "Rod Page Challenge"

Rdf w3c icon 128Steve Baskauf has concluded a thoughtful series of blog posts on RDF and biodiversity informatics with http://baskauf.blogspot.co.uk/2015/07/confessions-of-rdf-agnostic-part-7.html. In this post he discussed the "Rod Page Challenge", which was a series of grumpy posts I wrote (starting with this one) where I claimed RDF basically sucked, and to illustrate this I issued a challenge for people to do something interesting with some RDF I provided. Since this RDF didn't have a stable home I've put it on GitHub and it has a DOI http://dx.doi.org/10.5281/zenodo.20990 courtesy of GitHub's integration with Zenodo.

I argued that the RDF typically available was basically useless because it wasn't adequately linked (see Reflections on the TDWG RDF "Challenge"). Two of the RDF files I provided were created specifically created to tackle this problem (derived from my projects iPhylo Linkout http://dx.doi.org/10.1371/currents.RRN1228 and the precursor to BioNames http://dx.doi.org/10.7717/peerj.190). This marked pretty much the end of any interest I had in pursuing RDF.

Towards the end of Steve's post he writes:

At the close of my previous blog post, in addition to revisiting the Rod Page Challenge, I also promised to talk about what it would take to turn me from an RDF Agnostic into an RDF Believer. I will recap the main points about what I think it will take in order for the Rod Page Challenge to REALLY be met (i.e. for machines to make interesting inferences and provide humans with information about biodiversity that would not be obvious otherwise):

  1. Resource descriptions in RDF need to be rich in triples containing object properties that link to other IRI-identified resources.
  2. "Discovery" of IRI-identified resources is more likely to lead to interesting information when the linked IRIs are from Internet domains controlled by different providers.
  3. Materialized entailed triples do not necessarily lead to "learning" useful things. Materialized entailed triples are useful if they allow the construction of more clever or meaningful queries, or if they state relationships that would not be obvious to humans.

Steve's point 1 is essentially the point I was making with the challenge. At the time of the challenge, RDF from major biodiversity informatics projects was in silos, with few (if any) links to external resources (the kinds of things Steve refers to in his point 2). As a result, the promised benefits from RDF simply haven't materialised. The lesson I took from this is that we need rich, dense cross-links between data sources (the "biodiversity knowledge graph"), and that's one reason I've been obsessed with populating BioNames, which links animal names to the primary literature (I'm planning to extend this to plants as well). Turns out , creating lots of cross links is really hard work, much harder than simply pumping out a bunch of RDF and waiting for it to automagically coalesce into an all-connected knowledge graph.

I posed the challenge back in 2011, and since then I think the landscape has changed to the extent that I wonder if trying to "fix" RDF is really the way forward.

XML is dead

Anyone (sane) developing for the web and wanting to move data around is using JSON, XML is hideous and best avoided. Much of the early work on RDF used XML, which only made things even harder than they already were. JSON beats XML, to the extent that RDF itself now has a JSON serialisation, JSON-LD. But JSON-LD is about more than the semantic web (see JSON-LD and Why I Hate the Semantic Web), and has the great advantage that you can actually ignore all the RDF cruft (i.e., the namespaces) and simply treat the data as key-value pairs (yay!). Once you do that, then you can have fun with the data, especially with databases such as CouchDB ("fun" and "database" in the same sentence, I know!).

Key-value pairs, document stores, and graph databases

The NoSQL "movement" has thrown up all sorts of new ways to handle data and to think about databases. We can think of RDF as describing a graph, but it carries the burden of all the namespaces, vocabularies, and ontologies that come with it. Compare that with the fun (there's that word again) of graph databases such as Neo4J with its graph gists. The Neo4J folks have made a great job of publicising their approach, and making it easy and attractive to play with.

So, we're in a interesting time when there are a bunch of technologies available, and I think maybe it's time to ask whether the community's allegiance to RDF and the Semantic Web has been somewhat misplaced...

Thursday, June 25, 2015

Biodiversity Data Journal data lost on the way to GBIF and EOL

Two ongoing challenges in biodiversity informatics are getting data into a form that is usable, and linking that data across different projects platforms. A recent and interesting approach to this problem are "data journals" as exemplified by the Biodiversity Data Journal. I've been exploring some data from this journal that has been aggregated by GBIf and EOL, and have come across a few issues. In this post I'll firstly outline the standard format for moving data between biodiversity projects, the Darwin Core Archive, then illustrate some of the pitfalls.

Darwin Core Archive

Firstly a quick digression on the Darwin Core Archive format, which has a few gotchas for newcomers to the format (such as myself). The Darwin Core Archive supports a "star schema" like this.

Dwca

At the centre of the star is a table containing data either about taxa or occurrences. We can have additional tables with other sorts of data, and we also have a meta.xml file which tells us what all the data columns are and how the different tables are related to the core table.

For example, if we have taxa as our core, then we can have a table like this were each taxon has a unique taxon_id:

taxon_idtaxon stuff
1stuff
2stuff
3stuff

Now, imagine that we have a reference for each of these taxa (say it's the paper that originally described these species). Then we could add a unique identifier for that reference reference_id to the taxon table:

taxon_idreference_idtaxon stuff
1astuff
2astuff
3astuff

Now, if we were building a relational database we could have a separate table for the references, and link the two table using the reference_id as a primary key for the references and as a foreign key in the taxon table, like this:

reference_idreference stuff
areference

This means that we need only have the reference stored once, which means there's no redundancy. If we need to update the reference data, we only need to do it once.

However, this is not how Darwin Core Archive works. Because it's a star schema, we need to have a references table like this:

reference_idtaxon_idreference stuff
a1reference
a2reference
a3reference

Note that we have added the taxon_id to link the reference to each taxon, and that the same reference occurs three times (once for each taxon it refers to), hence we have redundancy. Note also that if we don't include the taxon_id key then there's no way for a Darwin Core Archive reader to link the reference to the corresponding taxa (we'll come back to this below).

I've said that the reference are in their own table. In fact, we can have everything in one big table, and use the meta.xml table to tell a Darwin Core Archive reader to process that same table but extract different data each time (the Mammal Species of the World checklist http://doi.org/10.15468/csfquc is an example of this). Hence, we could extract taxon_id and taxon stuff for the taxa, then reference_id, reference stuff for the references.

taxon_idreference_idtaxon stuffreference stuff
1astuffreference
2astuffreference
3astuffreference

The other thing to remember is that the meta.xml file is responsible for describing the data. It does this in two ways (1) it defines the type of data a given table contains (e.g., taxa, occurrence, image, etc.), and (2) it defines what each column in the data represents, using a controlled vocabulary.

The type of data each table contains is defined by a URI, and the list of these "registered extensions" is available from GBIF. The two "core" extensions are for taxa and occurrences, the two things GBIF primarily deals with, while the other extensions enable richer data to be added. Of course, a Darwin Core Archive consumer that doesn't understand these extensions can simply ignore them. Rather unfortunately, some extensions, such as the EOL media and references extensions overlap with the GBIF multimedia and references extensions. Hence, if you have, say images or bibliographic data, you have two extensions to choose from. If you choose EOL's then EOL will import your data, but GBIF won't. Furthermore, the extensions vary in richness. If you have bibliographic data then GBIF's vocabulary for references looks sparse and lacking many of the fields one might expect, whereas EOL's is quite rich.

Problems with Biodiversity Data Journal and GBIF

With that background, let's take a look at what happens to Biodiversity Data Journal (BDJ) data once it enters GBIF. For example, the species Eupolybothrus cavernicolus, described using "transcriptomic, DNA barcoding and micro-CT imaging data" (http://dx.doi.org/10.3897/BDJ.1.e1013). Data from this paper is in GBIF as both an occurrence dataset (http://doi.org/10.15468/zpz4ls) and checklist dataset (http://doi.org/10.15468/rpavbl).

Images

The checklist dataset includes both media and references. The images don't appear in GBIF, but are visible in EOL (e.g., http://eol.org/data_objects/26558840 shown below:

36163 orig Because the type for the media is set to a type (http://eol.org/schema/media/Document) that only EOL recognises, GBIF doesn't harvest the images, and hence misses out on all this extra multimedia goodness.

References

The references in the BDJ dataset don't appear in either GBIF or EOL (see http://eol.org/pages/38177334/literature). Presumably they don't appear in GBIF because BDJ uses EOL's extension, but why don't they appear in EOL? Looking at the raw data, the references.csv file in the Darwin Core lacks the coreid field needed to link the references to the corresponding taxon (the fiels is defined in the meta.xml file, but there is no corresponding column in the references.csv file. Looking at other BDJ Darwin Core Archives this seems to be a common problem.

Map

Strangely the BDJ paper shows a map with a point locality, but the same data in GBIF does not (see http://doi.org/10.15468/zpz4ls). Mapbdj

A look at the occurrences.csv shows that the file has verbatim latitude and longitude but not decimal versions of the coordinates, which is what GBIF uses to locate records on the map. So the BDJ data set isn't contributing any geographical data. Clearly a lot of BDJ data is georeferenced (see map), but not this example.

Taxa

The centipede Eupolybothrus cavernicolus is not in GBIF's backbone classification. This is a common issue, especially with newly described taxa. GBIF does not have access to recent nomenclatural data, and so even though the BDJ data comes with a ZooBank LSID urn:lsid:zoobank.org:act:6F9A6F3C-687A-436A-9497-70596584678C for the name Eupolybothrus cavernicolus, GBIF itself doesn't know about and so if you do a default search on the name Eupolybothrus cavernicolus you get only the genus.

Summary

Here are the issues I uncovered after a little bit of messing about:

  1. BDJ Darwin Core Archives don't support extensions recognised by GBIF.
  2. BDJ references lack the coreid for the taxa/occurrences and hence are not ingested by Darwin Core readers.
  3. BDJ does not seem to parse and interpret verbatim coordinates when generating Darwin Core Archives.
  4. GBIF doesn't support the extensions output by BDJ.
  5. GBIF's references extension is woefully inadequate for handling bibliographic metadata.
  6. GBIF's list of taxonomic names is woefully out of date.

What both puzzles and frustrates me is that a much trumpeted collaboration between these projects has significant problems which seem to have gone undetected. It seems as if it is enough to have a pipeline between a data journal and a project, without actually testing whether that pipeline loses or misrepresents the data. In some cases, very little of the data in a BDJ archive actually makes it into GBIF, which is wasteful and rather defeats the point of having a data journal to database pipeline in the first place.

Wednesday, June 24, 2015

Thoughts on ReCon 15: DOIs, GitHub, ORCID, altmetric, and transitive credit

Man03gTw 400x400I spent last Friday and Saturday at (Research in the 21st Century: Data, Analytics and Impact, hashtag #ReCon_15) in Edinburgh. Friday 19th was conference day, followed by a hackday at CodeBase. There's a Storify archive of the tweets so you can get a sense of the meeting.

Sitting in the audience a few things struck me.

  1. No identifier wars, DOIs have won and are everywhere.
  2. GitHub is influencing the way we do science, but we've much still to learn.
  3. ORCIDs are gaining traction.
  4. Nobody really understands "impact".

GitHub

GitHub is becoming more and more important, not only as a repository of scientific code and data, but as a useful model of sorts of things we need to be doing. Arron Smith gave a fascinating talk on GitHub. Apart from the obvious things such as version control, Arfon discussed the tools and mindset of open source programmers, and who that could be applied to scientific data. For example, software on GitHub is often automatically tested for bugs (and GitHub displays a badge saying whether things are OK). Imagine doing this for a data set, having it automatically checked for errors and/or internal consistency. Reproducibility is a big topic in science, but open source software has to be reproducible by default in the sense that it has to be able to be downloaded and compiled on a user's computer. This is just a couple of the things Arfon covered, see his slides for more.

Transitive Credit

One idea which particularly struck me was that of "transitive credit":

Katz, D. S. (2014, February 10). Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. JORS. Ubiquity Press, Ltd. http://doi.org/10.5334/jors.be

From the above paper:

The idea of transitive credit is as follows: The credit map for product A, which is used by product B, feeds into the credit map for product B. For example, product A is a software package equally written by two authors and its credit map is that 50 percent of the credit for this should go the lead developer, 20 percent to the second developer, and 10 percent to the third developer. In addition, 5 percent should go to each of the four libraries that are needed to run the code. When this product is created and registered, this credit map is registered along with it. Product B is a paper that obtains new science results, and it depended on Product A. The person who registers the publication also registers its credit map, in this case 75 percent to her/himself, and 25 percent to the software code previous mentioned. Credit is now transitive, in that the lead software developer of the code can be given credit for 12.5 percent of the paper. If another paper is later written that extends the product B paper and gives 10% credit to that paper, the lead software package developer will also have 1.25% credit for the new paper.
The idea of being able to track credit across derived products is interesting, and is especially relevant to projects such as GBIF, where users can download large datasets that are themselves aggregations of data from numerous different providers (making it was to calculate the relative contributions of each provider). If we then track citations of that data (and citations of those citations) we could give data providers a better estimate of the actual impact of their data.

Impact

Euan Adie of altimetric talked about "impact", and remarked on an example of a paper being cited in a policy document and this being picked up by altimetric and seen by the authors of the paper, who had no idea that their work had influenced a policy document. This raises some intriguing possibilities, related to the idea of "transitive credit" above.

In building BioNames I've added the ability to show altimetric "donuts" and I'm struck by examples like this one (see also reference in BioNames):

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

This paper has no recent "buzz" (e.g., Twitter, Facebook, Mendeley) but is cited on three Wikipedia pages. So, this paper has impact, albeit in social media. Many papers like this will slip below the social media radar but will be used by various databases and may contribute to subsequent work. Perhaps we could expand alt metrics sources of information to include some of those databases. For example, if a paper has been aggregated/cited by a major databases (such as GBIF) then it would be nice to see that on the altimetric donut. For authors this gives them another example of the impact of their work, but for the databases it's also an opportunity to increase engagement (if people have relevant work that doesn't appear in the donut they can take steps to have that work included in the aggregation). Obviously there are issues about what databases to count as providing signal for alt metrics, but there's scope here to broaden and quantify our notion of impact.

Hackday

The ReCon hackney was an pretty informal event held at CodeBase just down from Edinburgh Castle, and apparently the largest start-up incubator in the European tech scene. It was a pretty amazing place, and a great venue for a hackney. I spent the day looking at the ORCID API and seeing if I could create some mashups with Journal Map and my own BioNames. One goal was to see if we could generate a map of researcher's study sites starting with their ORCID, using ORCID's API to retrieve a list of their publications, then talking to the Journal Map API to get point localities for those papers. The code worked, but the results were a little disappointing because Jim Caryl and I were focussing on University of Glasgow researchers, and they had few papesri n Journal Map. The code, such as it is, is in GitHub.

My original idea was to focus on BioNames, and see how many authors of taxonomic papers had ORCIDs. Initial experiments seemed promising (see GitHub for code and data). Time was limited, so I got as far has building lists of DOIs from BioNames and discovering the associated ORCIDs. The next steps would be (a) providing ORCID login to BioNames, and using ORCID to help cluster author name strings in BioNames. Still much to do.

I've not been to many hackdays/hackathons, but I find them much more rewarding than simply sitting in a lecture theatre and listening to people talk. Combining both types of meeting is great, and I look forward to similar event sin the future.

Visualising Geophylogenies in Web Maps Using GeoJSON

Fig3 GoogleMaps CC BY no logo 300x205I've published a short note on my work on geophylogenies and GeoJSON in PLoS Currents Tree of Life:

Page R. Visualising Geophylogenies in Web Maps Using GeoJSON. PLOS Currents Tree of Life. 2015 Jun 23 . Edition 1. doi:10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e.
At the time of writing the DOI hasn't registered, so the direct link is here. There is a GitHub repository for the manuscript and code.

I chose PLoS Currents Tree of Life because it is (supposedly) quick and cheap. Unfortunately a perfect storm of delays in reviewing together with licensing issues resulted in the paper taking nearly three months to appear. The licensing issues were a headache. PLoS uses the Creative Commons CC-BY license for all its content. Unfortunately, the original submission included maps from Google Maps and Open Street Map (OSM), to show that the GeoJSON produced by my tool could work with either. Google Maps tile imagery is not freely available, so I had to replace that in order for PLoS to be able to publish my figures. At first I used simply replaced the tiles Google Maps displays with ones from OSM, but those tiles are CC-BY-SA, which is incompatible with PLoS's use of CC-BY. Argh! I got stroppy about this on Twitter:

Eventually I discovered maps from CartoDB that have CC-BY licenses, and so could be used in the PLoS Currents article. After replacing Google's and OSM tiles with these maps (and trimming off the "Google" logo) the figures were acceptable to PLoS. Increasingly I think Creative Commons has resulted in a mess of mutually incompatible licenses that make mashing up things hard. The idea was great ("skip the intermediaries" by declaring that your content can be used), but the outcome is messy and frustrating.

But, enough grumbling. The article is out, the code is in GitHib. Now to think about how to use it.

Tuesday, May 19, 2015

Text mining for museum specimen identifiers

Class EMuMedia phpThis post is a response to Ross Mounce's post Text mining for museum specimen identifiers. As Ross notes in that post, mining literature for specimen codes is something I've been interested in for a while (search for specimen codes on iPhylo), and @Aime Rankin (formerly an undergraduate student at Glasgow) did some work on this as well. It's great to see progress in this area.

Here are some thoughts on Ross's post (I'm posting here rather than as a comment on Ross's blog because this is going to be long).

What questions to ask?

Obviously there's a lot of scope for metrics, such as numbers of citations for individual specimens, and league tables for collections (see GBIF specimens in BioStor: who are the top ten museums with citable specimens?). As Ross notes, there's also scope for updating out of date museum metadata with information from the literature (e.g., Linking data from the NHM portal with content in BHL), but even more interesting is the potential to cross-link databases in a way that permits novel queries. For example, if we have a paper on a disease that includes data we can link to a georeferenced specimen, then we can enable spatial queries for diseases (e.g., BHL and GBIF as biomedical databases).

Materials for mining

From my perspective the obvious corpus to mine is the Biodiversity Heritage Library (BHL). Ross repeats the erroneous view that BHL is just "legacy" literature. Apart from the obvious point that everything not published right not is, by definition, legacy, BHL has a lot of modern content (including papers published in the last couple of years).

Furthermore, there are journals that cite Natural History Museum specimens, including "in house" journals (e.g., Bulletin of the British Museum (Natural History) Zoology and Bulletin of the Natural History Museum. Zoology series), as well as the Bulletin of the British Ornithologists' Club which has published lots of new bird names for which the type specimen is often in the NHM.

I guess one issue is accessibility. Ross notes that:

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.
So, how we can make BHL content as accessible? For each article I've extracted from BHL and stored in BioStor you can get full text by simply appending ".text" to the BioStor URL, but this isn't quite the same as grabbing a big dump of text.

The other source of mining is GenBank, which has a lot of sequences that have NHM vouchers, but also a weird and wonderful array of ways of recording those specimens. This is one reason I'm building "Material examined", to cope with these codes. For example sequence KF281084 has voucher "TRING 1877111743" which more traditionally would be written as "BMNH 1877.11.17.43", which is "NHMUK 1877.11.17.43" in the NHM database. This is just one example of the horrors of matching specimen codes (for more see the code for Material examined).

One reason GenBank is useful is that the sequences are often linked to the literature, which means you get to make the link between specimen and literature without actually needing to mine the text itself (handy if access is problematic).

Bonus question: How should I publish this annotation data?

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?
Personally I'd avoid RDF because that way lies madness (or at least endless detours haggling about ontologies).

But making the output useful is an important question. Despite the fact that it is a bit clunky, I suspect Darwin Core Archives are the way to go. The core data is a CSV table, so it's easy to generate, and also easy to use. Lets say you analysed a particular corpus (e.g., PLoS ONE), you could then output the data in Darwin Core (making sure both specimen and publication had stable identifiers), then package it up and upload to Zenodo or Figshare and get a DOI. For bonus points, it would be great to see this data on GBIF, but this would require (a) mapping NHM specimen codes to GBIF ids (the NHM has this), and (b) GBIF being able to recognise that the data you're adding is not new specimens but rather annotations of existing specimens.

Things to think about

Here are a couple of additional things to think about.

Specimen finding as a service

In the same way that we have taxonomic name-finding services, it would be great if we had a specimen code-finding service. I have code that I use in BioStor, but it would be great to have something that is robust, stable, and generalisable across multiple specimen codes. My tool Material examined focusses on parsing a single string rather than parsing a block of text, but adding that functionality is an obvious thing to do.

Markup as output

One concern I have with work that involves mining text is that we hardly ever store the intermediate step of text + located elements. Instead we get to see sumamry output (e.g., this page has these three scientific names, and these 10 specimen codes). As Terry Catapano (@catapanoth) once wisely pointed out "indexing is markup", in that if you find a substring in some text, you have in effect marked up the text. Can we preserved the marked up text so that we go back and look at it and improve our text mining methods, or make that markup available to others to build upon it? There are all sorts of things which could be built upon this information, for example, imaging if the results where given to BHL so that people could search by specimen code.

Thursday, May 14, 2015

The value of ION to GBIF

Ion hdr homeThis a quick writeup of an analysis I did to make the case that the list of names held by the Index of Organism Names (ION) (part of Thomson Reuters) would be very useful for GBIF. I must declare a bias, in that I've spent a good chunk of the last 3-4 years exploring the ION database and investigating ways to link the taxonomic names it contains to the primary taxonomic literature, culminating in building BioNames.

What makes ION special is its scope (it endeavours to have all names covered by the ICZN), and that many of its names have associated citation information (i.e., details on the publication that published the name). Like any name database it has duplications and errors, and some of the older content is a bit ropey, but it's a tremendous resource and from my perspective nothing else in zoology come close.

But rather than rely on anecdote, I decided to do a quick analysis to see what ION could potentially add to GBIF. I've been doing some work on bird names recently, so as an exercise I searched GBIF for holotype specimens for birds. The search (13 May 2015) returned 11,664 records. I then filtered those on taxonomic names that GBIF could not match exactly (TAXON_MATCH_FUZZY) or names that GBIF could only match to a higher rank (TAXON_MATCH_HIGHERRANK). The query URL is:

http://www.gbif.org/occurrence/search?TAXON_KEY=212 &TYPE_STATUS=HOLOTYPE &ISSUE=TAXON_MATCH_FUZZY &ISSUE=TAXON_MATCH_HIGHERRANK

This query found 6,928 records, so over half the bird holotype specimens in GBIF do not match a taxonomic name in GBIF. What this means is that GBIF can't accurately place these names in its own taxonomic hierarchy. It also makes it hard to do meaningful analyses of things such as "how long does it take before a bird specimen is collected to when it is described as a new species?" because if you can match the name then you can't get the date the name was published.

To explore this further, I downloaded the results of the query (the download has DOI http://doi.org/10.15468/dl.vce3ay). I then wrote a script to parse the specimen records and extract the GBIF occurrence id, catalogue number, and scientific name. I then used the GBIF API to retrieve (where available) the verbatim record for each specimen (using the URL http://api.gbif.org/v1/occurrence//verbatim where is the occurrence id). This gives us the original name on the specimen, which I then looked up in BioNames using its API. If I got a hit I extracted the identifier of the name (the LSID in the ION database) and the corresponding publication id in BioNames (if available). If there was a publication associated with the name I then generated a human-readable citation using BioNames’s citeproc API. The code for all this is on github.

Here's a sample of the mapping:

OccurrenceHolotypeGBIF matched nameVerbatim nameIONBioNamesPublicaton
883603238USNM PAL378357.3368464Porzana Vieillot, 1816Porzana severnsi8796592c4f3...
Olson, S. L., & James, H. F. (1991). Descriptions of thirty-two new species of birds from the Hawaiian Islands: Part 1. Non-Passeriformes. Ornithological Monographs, 45, 1-88. doi:10.2307/40166794
858732312AMNH Skin-245914Otus choliba (Vieillot, 1817)Otus choliba duidae4307811b3315...
Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988
858732345AMNH Skin-245936Atlapetes Wagler, 1831Atlapetes duidae4307791b3315...
Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988
858733764AMNH Skin-45339Leptotila Swainson, 1837Leptotila gaumeri Lawr.
858744126AMNH Skin-218110Zosterops Vigors & Horsfield, 1827Zosterops alberti ablita

The complete result of this mapping can be viewed here. Of the 6,392 holotypes with names not recognised by GBIF, nearly half (3,165, 49.5%) exactly matched a name in ION. Many of these are also linked to the publication that published that name.

So, adding ION help us find half the missing holotype names. This is before doing anything more sophisticated, such as approximate string matching, resolving synonyms, etc. Hence, I'd argue that the names in ION would add a lot to GBIF's ability to interpret the occurrence records it receives from museums.

I've not had time for further analysis, but at first glance a lot of the missed names are subspecies, the are quite a few fossils, and many names are in the relatively older literature. However there are also some recently described taxa, such as the hawk-owl Ninox rumseyi Rasmussen et al. 2012, and a bunting subspecies from Tristan du Cuhna (Nesospiza acunhae fraseri Ryan, 2008) that are missing from GBIF.

Friday, May 08, 2015

Putting some bite into the Bouchout Declaration

There are no requirements for signing up. A signature is first and foremost a statement of support for open data . Each signatory can determine how best to make progress towards the goal. Some recommendations are included in the declaration. We hope that signatories will become early adopters of the open access approach, that they will promote change in their institutions, societies and journals, and will position themselves and their institutions as leaders. (from http://www.bouchoutdeclaration.org/faqs/)
M13QFZi4I've put off writing this post about the Bouchout Declaration for a number of reasons. I attended the meeting that launched the declaration last year, and from my perspective that was a frustrating meeting. Much talk about "Open Biodiversity Knowledge Management" with nobody seemingly willing or able to define it (see The vision thing - it's all about the links for some comments I made before attending the meeting), and as much as the signing of the Boechout Declaration provided good theatre, it struck me as essentially an empty gesture. Public pronouncements are all well and good, but are ultimately of little value unless backed up by action. We have institutions that have signed the declaration yet have much of their intellectual output locked behind paywalls (e.g., JSTOR Global Plants). So much for being open.

So, since Donat challenged me, here's what I'd like to see happen. I'd like to see metrics of "openness" that we can use to evaluate just how open the signatories actually are. These metrics could be viewed as ways to try and persuade institutions into sharing data and other information, as a league table we can use to apply pressure, or as a way to survey the field and see what the impediments are to being open (are they financial, legal, cultural, resource, etc.).

Below are some of the things I think we could "score" the openness of biodiversity institutions.

Is the collection digitised and in GBIF?

Simple criterion that is easy to measure. If an institution has specimens or other biological material, is data and or metadata on the collection freely available? What fraction of the collection has been digitised? How good is that digitsation (e.g., what fraction has been georeferenced?). We could define digitisation more broadly to include imaging and sequencing (both are methods of converting analogue specimens into digital objects).

Are the institutional publications digitised? Are they open access?

Some institutions have a history of digitising their in-house publications and making them freely available online (e.g., the AMNH), some even make them fully citable with CrossRef DOIs (e.g., the Australian Museum). But some institutions have, sadly, signed over their publications to commercial publishers or archives that charge for access (e.g., Kew's publications have been digitised by JSTOR, which limits their accessibility). As a foot note, I suspect that those institutions that lost confidence in their in-house publishing operations and outsourced them are the ones who have ended up loosing control of their intellectual output, some of which is now closed off (e.g., some of the NHM London's journals are now the property of Cambridge University Press). Those institutions that maintained a culture of in-house publishing are the ones at the vanguard of digitising and opening up those publications.

Does the institution take part on the Biodiversity Heritage Library?

There are at least two ways to participate in the Biodiversity Heritage Library (BHL), one is by becoming a member and start scanning books from institutional libraries. The other is by granting permission to BHL to scan institutional publications. BHL is often viewed as an archive of "old" literature, but in fact it has some very recent content. Some farsighted organisations have let BHL scan their journals, contributing to BHL becoming an indispensable resource for biodiversity research.

Do institution staff publish in open access journals?

A while ago I complained about how few new species descriptions were in open access journals (The top-ten new species described in 2010 and the failure of taxonomy to embrace Open Access publication). A measure of openness is whether an institution encourages its staff to publish their work in open access journals, and to make their data freely available as well. Some prefer to chase Nature and Science papers, but I'd like to think we could prioritise openness over journal impact factor.

These are just some of the more obvious things that could be used to measure openness. At the same time, it would be useful to develop ways to show the benefits of being open. For example, I've long argued that we could develop citation tracking for specimens. This gives researchers a means to track provenance of information (who said what about the identity of a specimen), and it also gives institutions a way to measure the impact of their collections. Doing this at scale is only going to be possible if collections are digitised, specimens have identifiers of some sort, and we can text mine the literature and associated data for those identifiers (in other words, the data and publications need to be open). So, perhaps on way to help make the case for being open is to develop metrics that are useful for the institutions themselves.

I guess I would have been much more enthusiastic about the Bouchout Declaration if these sort of things had been in place at the start. Anyone can sign a document. Ideas are cheap, execution is everything.

Tuesday, April 21, 2015

Looking up specimen codes in GBIF using Google Spreadsheet

Playing with the my "material examined" tool I've been working on, I wondered whether I could make use of it in, say, a spreadsheet. Imagine that I have a spreadsheet of museum codes and want to look those up in GBIF. I could create a service for Open Refine but Open Refine is a bit big and clunky, you have to fire up a Java application and point your browser at it, and Open Refine isn't as intuitive or as flexible as a spreadsheet.

It turns that Google Spreadsheets supports custom functions, including importing JSDON from a remote data source. Following How to import JSON data into Google Spreadsheets in less than 5 minutes here's what to do:

  1. Create a new Google Spreadsheet.
  2. Click on Tools -> Script Editor.
  3. Click Create script for Spreadsheet.
  4. Delete the placeholder content and paste the code from this script.
  5. Rename the script to ImportJSON.gs and click the save button.
  6. Back in the spreadsheet, in a cell, you can type “=ImportJSON()” and begin filling out it’s parameters.

Lets imagine we have a spreadsheet with a specimen code in cell A1, e.g. "FMNH 187122".

S1

To call the material examined service, we need a function like this:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A1,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

Paste this into cell B1 (i.e., just to the right of the specimen code) and after a short delay you should see something like this:

S2

The three parameters supplied to ImportJSON are are the query URL, written as a spreadsheet function that grabs the specimen code from cell A1, a list of the bits of data we want to extract from the result (expressed as JSON paths), and some options (in this case, don't show the headers). ImportJSON will grab the specimen code in cell A1, add it to the query URL, then output the results. You should see something like this:

The first column is the GBIF occurrence ID, the second is the scientific name (you can add more JSON paths to get more fields).

Note that we have multiple rows as there is more than one specimen with the code "FMNH 187122" in GBIF. Now, we can ask the material examined service to return only certain taxa (such as mammals) by adding the "scientificName" parameter:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A10,"&scientificName=",B10,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

If you put the specimen code in cell A10, and the higher taxon "Mammalia" in cell B10, and paste the function above into cell C10, then you should see something like this:

S3

Note that now we have a single row with the mammal specimen.

It's a little bit fussy (you need to get the ImportJSON script, and mess a bit with the parameters but it's quick and flexible, and you get all the power of a spreadsheet to help clean the data before trying to match it to GBIF. Plus you can do it all in your browser.

Wednesday, April 15, 2015

GBIF Ebbe Nielsen Challenge finalists announced

The six finalists for the GBIF Ebbe Nielsen Challenge have been announced by GBIF:

“The creativity and ambition displayed by the finalists is inspiring’, said Roderic Page, chair of the Challenge jury and the GBIF Science Committee, who introduced the Challenge at GBIF’s 2014 Science Symposium in October.

“My biggest hope for the Challenge was that the biodiversity community would respond with innovative—even unexpected—entries,” Page said. “My expectations have been exceeded, and the Jury is eager to see what the finalists can achieve between now and the final round of judging.”

The finalists all receive a €1,000 prize, and now have the possibility to refine their work and compete for the grand prize of €20,000 (€5000 for second place). F1F2F3 As the rather cheesy quote above suggests, I think the challenge has been a success in terms of the interest generated, and the quality of the entrants. While the finalists bask in glory, it's worth thinking about the future of the challenge. If it is regarded as a success, should it be run in the same way next year? The first challenge was very open in terms of scope (pretty much anything that used GBIF data), would it be better to target the challenge on a more focussed area? If so, which area needs the nost attention. Food for thought.

Linking specimen codes to GBIF

I've put together a working demo of some code I've been working on to discover GBIF records that correspond to museum specimen codes. The live demo is at http://bionames.org/~rpage/material-examined/ and code is on GitHub.

To use the demo, simply paste in a specimen code (e.g., "MCZ 24351") and click Find and it will do it's best to parse the code, then go off to GBIF and see what it can find. Some examples that are fun include MCZ 24351, KU:IT:00312, MNHN 2003-1054, and AMS I33708-051

Material

It's proof of concept at this stage, and the search is "live", I'm not (yet) storing any results. For now I simply want to explore how well if can find matches in GBIF.

By itself this isn't terribly exciting, but it's a key step towards some of the things I want to do. For example, the NCBI is interested in flagging sequences from type specimens (see http://dx.doi.org/10.1093/nar/gku1127 ), so we could imagine taking lists of type specimens from GBIF and trying to match those to voucher codes in GenBank. I've played a little with this, unfortunately there seem to be lots of cases where GBIF doesn’t know that a specimen is, in fact, a type.

Another thing I’m interested in is cases where GBIF has a georeferenced specimen but GenBank doesn’t (or visa versa), as a stepping stone towards creating geophylogenies. For example, in order to create a geophylogeny for Agnotecous crickets in New Caledonia (see GeoJSON and geophylogenies ) I needed to combine sequence data from NCBI with locality data from GBIF.

It’s becoming increasingly clear to me that the data supplied to GBIF is often horribly out of date compared to what is in the literature. Often all GBIF gets is what has been scribbled in a collection catalogue. By linking GBIF records to specimen codes cited that are cited in the literature we could imagine giving GBIF users enhanced information on a given occurrence (and at the same time get citation counts for specimens The impact of museum collections: one collection ≈ one Nobel Prize).

Lastly, if we can link specimens to sequences and the literature, then we can populate more of the biodiversity knowledge graph

Tuesday, March 10, 2015

GBIF Ebbe Nielsen Challenge submissions: judging begins

The GBIF Ebbe Nielsen Challenge has closed and we have 23 submissions for the jury to evaluate. There's quite a range of project types (and media, including sound and physical objects), and it's going to be fascinating to evaluate all the entries (some of which are shown below). This is the first time GBIF has run this challenge, so it's gratifying to see so much creativity in response to the challenge. Submissions While judging itself is limited to the jury (of which I'm a member), I'd encourage anyone interested in biodiversity informatics to browse the submissions. Although you can't leave comments directly on the submissions within the GBIF Challenge pages, each submission also appears on the portfolio page of the person/organisation that created the entry, so you can leave comments there (follow the link at the bottom of the page for each submission to see it on the portfolio page).

Friday, February 20, 2015

More examples of data duplication and loss in GBIF: Australian bats in bits

Quick notes on another example of data duplication in GBIF. I'm in the process of building a tool to map specimen codes to GBIF records, and came across the following example. Consider the specimen code "AM M.22320", which is the voucher for the sequence KJ532444 (GenBank gives the voucher as M22320, but the associated paper doi:10.1016/j.ympev.2014.03.009 clarifies that this specimen comes from the Australian Museum). Locating this specimen in GBIF I found a series of records that were identical except for the catalogNumbers, which looked like this: M.22320.001, M.22320.002, etc. What gives?

Initially I thought this may be a simple case of data duplication (maybe the suffixes represent different versions of the same record?). Then I managed to locate the records on the Australian Museum web site:.

  • M.22320.009 - Wet Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.008 - Skull Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.001 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.005 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.006 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.007 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.003 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.004 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.002 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.010 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990

Turns out we have 10 records for "M.22320", which include various preparations and tissue samples. The holotype specimen for Pteralopex taki (originally described in doi:10.1071/AM01145, see BioNames) has generated 10 different records., all of which have ended up in GBIF. Anyone using GBIF occurrence data and interpreting the number of occurrence records as a measure of how abundant an organism is at a given locality is clearly going to be misled by data like this.

One way to tackle this problem would be if GBIF (or the data provider) could cluster the records that represent the "same" specimen, so GBIF doesn't end up duplicating the same information (in this case, 10-fold). The Australian Museum records don't seem to specify a direct link between the 10 records. I then located the same records in OZCAM, the data provider that feeds GBIF. Here is the OZCAM record for "M.22320.001": http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064. OZCAM doesn't have the information on whether the record is a skull, a wet preparation, or a tissue sample, that information has been lost, and hence doesn't make it as far as GBIF.

Note that OZCAM has resolvable identifiers for each specimen in the form of UUIDs that are appended to "http://ozcam.ala.org.au/occurrence/". The corresponding UUIDs are included in the Darwin Core dump that OZCAM makes available to GBIF. Here they are for the parts of M.22320:

"223d1549-1322-419e-8af4-649a4b145064","M.22320.001",...
"c40a7eea-6e04-4be6-8dcb-4473402e48c4","M.22320.002",...
"21fcaea1-c645-49d9-9753-dbd9dd2bc64a","M.22320.003",...
"34ffd935-9fb4-44a5-acb8-2cd4df5ade62","M.22320.004",...
"03635fb8-f9ac-4c4c-898b-859cd42f1e26","M.22320.005",...
"a1c4dd5a-dc03-45cc-8971-931c739df8b2","M.22320.006",...
"71c91030-405c-4390-8ec3-42a5478a2fd8","M.22320.007",...
"0f1a9326-34d0-4fb2-b89a-9856bd9082f0","M.22320.008",...
"86270ef7-07f6-4395-84c7-66d5d497cc01","M.22320.009",...

But when GBIF parses the dump it ignores these UUIDs, which means the GBIF user can't easily go to the OZCAM site (which has a bunch of other useful information, compare http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064 with http://www.gbif.org/occurrence/774916561/verbatim ). It also means that GBIF has stripped out an identifier that we might make use of to unambiguously refer to each record (and, presumably, this UUID doesn't change between harvests of OZCAM data).

In summary, this is a bit of a mess: we have multiple records that are really just bits of the same specimen but which are not linked together by any data provider, and as the data is transmitted up the chain to GBIF clues as to what is going on are stripped out. For a user like me who is trying to link the GenBank sequence to its voucher this is frustrating, and ultimately all rather avoidable if we took just a little more care in how we represent data about specimens, and how we treat that data as it gets transmitted between data bases.

Thursday, February 19, 2015

Post publication review on PubPeer

036e889646a0ff112e5f150a625f5268PubPeer is a web site where people can discuss published articles, anonymously if they prefer. I finally got a chance to play with it a few days, it it was a fascinating experience. You simply type in the DOI or PMID for an article and see if anyone has said anything about that article. It also automatically pulls comments from PubMed Commons, for example the article Putting GenBank data on the map has a comment that was originally published as a guest post on this blog. PubPeer knows about this blog post via Altmetric, which is another nice feature. PubPeer also has browser extensions which, if you install one, automatically flags DOIs on web pages that have comments on PubPeer. Also nice.

So, I took PubPeer for a spin. While browsing GenBank and GBIF, as you do, I came across the following paper: "Conservation genetics of Australasian sailfin lizards: Flagship species threatened by coastal development and insufficient protected area coverage" doi:10.1016/j.biocon.2013.10.014. Some of the sequences from this paper, such as KF874877 are flagged as "UNVERIFIED". Puzzled by this, I raised the issue on PubPeer (see https://pubpeer.com/publications/D1090D7AF8178B1A10C4C45AC1006E ). A little further digging led to the suggestion that they were numts. After raising the issue on Twitter, one of the authors (Cameron Siler) got in touch and reported that there had been an accidental deletion of a single nucleotide in an alignment. Cameron is updating the Dryad data (http://dx.doi.org/10.5061/dryad.1fs7c ) and GenBank sequences.

I like the idea that there is a place we can go to discuss the contents of a paper. It's not controlled by the journal, and you can either identify yourself or remain anonymous if you prefer. Not everyone is a fan of this mode of commentary, especially it is possible for people to make all sorts of accusations while remaining anonymous. But it's a fascinating project, and well worth spending some time browsing around (what IS it with physicists?). For anyone interested in annotating data, it's also a nice example of one possible approach.

Wednesday, January 28, 2015

Annotating GBIF, from datasets to nanopublications

Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

  1. A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
  2. A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
  3. A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.

Duplication

As an example, consider the following two occurrences for Psilogramma menephron:

occurrencetaxonlongitudelatitudecatalogue numbersequence
887386322Psilogramma menephron Cramer, 1780145.86301-17.44BC ZSM Lep 01337
1009633027Psilogramma menephron Cramer, 1780145.86-17.44KJ168695KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIs

In terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.

Nanopublications

If we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are:
A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.
A nanopublication comprises three elements:
  1. The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
  2. The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
  3. The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112
Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?

Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.

Summary

My goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.