Thursday, October 25, 2018

Taxonomic publications as patch files and the notion of taxonomic concepts

There's a slow-burning discussion on taxonomic concepts on Github that I am half participating in. As seems inevitable in any discussion of taxonomy, there's a lot of floundering about given that there's lots of jargon - much of it used in different ways by different people - and people are coming at the problem from different perspectives.

In one sense, taxonomy is pretty straightforward. We have taxonomic names (labels), we have taxa (sets) that we apply those labels to, and a classification (typically a set of nested sets, i.e., a tree) of those taxa. So, if we download, say, GenBank, or GBIF, or BOLD we can pretty easily model names (e.g., a list of strings), the taxonomic tree (e.g., a parent-child hierarchy), and we have a straightforward definition of the terminal taxa (leaves) or the tree: they comprise the specimens and observations (GBIF), or sequences (GenBank and BOLD) assigned to that taxon (i.e., for each specimen or sequence we have a pointer to the taxon to which it belongs).

Given this, one response to the taxonomic concept discussion is to simply ignore it as irrelevant, and we can demonstrably do a lot of science without it. I suspect most people dealing with GBIF and GenBank data aren't aware of the taxonomic concept issue. Which begs the question, why the ongoing discussion about concepts?

Perhaps the fundamental issue is that taxonomic classification changes over time, and hence the interpretation of a taxon can change over time. In other words, the problem is one of versioning. Once again, the simplest strategy to deal with this is simply use the latest version. In much the same way that most of us probably just read the latest version of a Wikipedia page, and many of us are happy to have our phone apps update automatically, I suspect most are happy to just grab the latest version and do some, you know, science.

I think taxonomic concepts really become relevant when we are aggregating data from sources where the data may not be current. In other words, where data is associated with a particular taxonomic name and the interpretation of that name has changed since the last time the data was curated. If the relationships of a taxon or specimen can be computed on the fly, e.g. if the data is a DNA barcode, then this issue is less relevant because we can simply re-cluster the sequences and discover where the specimen with that sequence belongs in a new classification. But for many specimens we don't have sufficient information to do this computation (this is one reason DNA barcodes are so useful, everything needed to determine a barcode's relationship is contained in the sequence itself).

To make this concrete, consider the genus Brookesia in GBIF (GBIF:2449310.

Screenshot 2018 10 25 11 43

According to Wikipedia Brookesia is endemic to Madagascar, so why does it appear on the African mainland? There are two records from Africa, Brookesia brookesia ionidesi collected in 1957 and Brookesia temporalis collected in 1926. Both represent taxa that were in the genus Brookesia at one point, but are now in different genera. So our notion of Brookesia has changed over time, but curation of these records has yet to catch up with that.

So, what would be ideal would be if we have a timestamped series of classifications so that we could go back in time and see what a given taxon meant at a given time, and then go forward to see the status of that taxon today. Having such a timestamped series is not a trivial task, indeed it may only be available in well studied groups. Birds are one such group, where each year eBird updates the current bird classification based on taxonomic activity over the previous year. As part of the Github discussion I posted visual "diff" between two bird classifications:

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

You can see the complete diff here, and the blog post Visualising the difference between two taxonomic classifications for details on the method.. The illustration above shows the movement of one species from Sasia to Verreauxia.

So, given two classifications we can compute the difference between them, and represent that difference as an "edit script" or operations to convert one tree into another. These edits are essentially what taxonomists do when they revise a group, they do things such as move species form one genus to another, merge some taxa, sink others into synonymy, and so on. So, taxonomy is essentially creating a series of edit files ("patches") to a classification. At a recent workshop in Ottawa Karen Cranston pointed out that the Open Tree of Life has been accumulating amendments to their classification and that these are essentially patch files.

Hence, we could have a markup language for taxonomic work that described that work in terms of edit operations that can then be automatically applied to an existing classification. We could imagine encoding all the bird taxonomy for a year in this way, applying those patches to the previous years' tree, and out pops the new classification. The classification becomes an evolving document under version control (think GitHub for trees). Of course, we'd need something to detect whether two different papers were proposing incompatible changes, but that's essentially a tree compatibility problem.

One way to store version information would be to use time-based versioned graphs. Essentially, we start with each node in the classification tree having a start date (e.g., 2017) and an open-ended end date. A taxonomic work post 2017 that, say, moved a species from one genus to another would set the end date for the parent-child link between genus and species, and create a new timestamped node linking the species to its new genus. To generate the 2018 classification we simply extract all links in the tree whose date range includes 2018 (which means the old generic assignment for the species is not included). This approach gives us a mechanism for automating the updating of a classification, as well as time-based versioning.

I think something along these lines would create something useful, and focus the taxonomic discussion on solving a specific problem.

Wednesday, October 24, 2018

Specimens, collections, researchers, and publications: towards social and citation graphs for natural history collections

Being in Ottawa last week for a hackathon meant I could catch up with David Shorthouse (@dpsSpiders. David has been doing some neat work on linking specimens to identifiers for researchers, such as ORCIDs, and tracking citations of specimens in the literature.

David's Bloodhound tool processes lots of GBIF data for occurrences with names of those who collected or identified specimens. If you have an ORCID (and if you are a researcher you really should) then you can "claim" your specimens simply by logging in with your ORCID. My modest profile lists New Zealand crabs I collected while an undergraduate at Auckland University.

Screenshot 2018 10 24 18 11

Unlike many biodiversity projects, Bloodhound is aimed squarely at individual researchers, it provides a means for you to show your contribution collecting and identifying the world's biodiversity. This raises the possibility of one day being able to add this information to your ORCID profile (in the way that currently ORCID can record your publications, data sets, and other work attached to a DOI). As David explains:

A significant contributing factor for this apparent neglect is the lack of a professional reward system; one that articulates and quantifies the breadth and depth of activities and expertise required to collect and identify specimens, maintain them, digitize their labels, mobilize the data, and enhance these data as errors and omissions are identified by stakeholders. If people throughout the full value-chain in natural history collections received professional credit for their efforts, ideally recognized by their administrators and funding bodies, they would prioritize traditionally unrewarded tasks and could convincingly self-advocate. Proper methods of attribution at both the individual and institutional level are essential.

Attribution at institutional level is an ongoing theme for natural history collections: how do they successfully demonstrate the value of their collections?

Mark Carnall's (@mark_carnall) tweet illustrates the mismatch between a modern world of interconnected data and the reality of museums trying to track usage of their collections by requesting reprints. The idea of tracking citations of specimens and or collections has been around for a while. For example, I did some work text mining BioStor for museum specimen codes, Ross Mounce and Aime Rankin have worked on tracking citations of Natural History Museum specimens (https://github.com/rossmounce/NHM-specimens), and there is the clever use of Google Scholar by Winker and Withrow (see The impact of museum collections: one collection ≈ one Nobel Prize and https://doi.org/10.1038/493480b).

David has developed a nice tool that shows citations of specimens and/or collections from the Canadian Museum of Nature.

Screenshot 2018 10 24 14 10

I'm sure many natural history collections would love a tool like this!

Note the altmetric.com "doughnuts" showing the attention each publication is receiving. These doughnuts are possible only because the publishing industry got together and adopted the same identifier system (DOIs). The existence of persistent identifiers enables a whole ecosystem to emerge based around those identifiers (and services to support those identifiers).

The biodiversity community has failed to achieve something similar, despite several attempts. Part of the problem is the cargo-cult obsession with "identifiers" rather than focussing on the bigger picture. So we have various attempts to create identifiers for specimens (see "Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens" https://doi.org/10.1002/aps3.1027 for a review), but little thought given to how to build an ecosystem around those identifiers. We seem doomed to recreate all the painful steps publishers went through as created a menagerie of identifiers (e.g., SICIs, PII) and alternative linking strategies ("just in time" versus "just in case") until they settled on managed identifiers (DOIs) with centralised discovery tools (provided by CrossRef).

Specimen-level identifiers are potentially very useful, especially for cross linking records in GBIF, GenBank, and BOLD, as well as tracking citations, but not every taxonomic community has a history of citing specimens individually. Hence we may also want count citations at collection and institutional level. Once again we run into the issue that we lack persistent, widely used identifiers. The GRBio project to assign such identifiers has died, despite appeals to the community for support (see GRBio: A Call for Community Curation - what community?). Given Wikidata's growing role as an identity broker, a sensible strategy might be to focus on having every collection and institution in Wikidata (many are already) and add the relevant identifiers there. For example, Index Herbarium codes are now a recognised property in Wikidata, as seen in the entry for Cambridge University Herbarium (CGE).

But we will need more than technical solutions, we will also need compelling drivers to track specimen and collection use. The success of CrossRef has been due in part to the network effects inherent in the citation graph. Each publisher has a vested interest in using DOIs because other CrossRef members will include those DOIs in the list of literature cited, which means that each publisher potentially gets traffic from other members. Companies like altmetric.com (of doughnut fame) make money by selling data on attention papers receive to publishers and academic institutions, based on tracking mention of identifiers. Perhaps natural history collections should follow their lead and ask how they can get an equivalent system, in other words, how do we scale tools such as the Canadian Museum of Nature citation tracker across the whole network? And in particular, what services do you want and how much would those services be worth to you?

Ottawa Ecobiomics hackathon: graph databases and Wikidata

Flag of Canada Pantone svg I spent last week in Ottawa at a "Ecobiomics" hackathon organised by Joel Sachs. Essentially we spent a week exploring the application of linked data to various topics in biodiversity, with an emphasis on looking at working examples. Topics covered included:

In addition to the above I spent some of the time working on encoding GBIF specimen data in RDF with a view to adding this to Ozymandias. Having Steve Baskauf (@baskaufs) at the workshop was a great incentive to work on this, given his work with Cam Webb on Darwin-SW: Darwin Core-based terms for expressing biodiversity data as RDF.

A report is being written up which will discuss what we got up to in more detail, but one take away for me is the large cognitive burden that still stands in the way of widespread adoption of linked data approaches in biodiversity. Products such as Metaphactory go some way to hiding the complexity, but the overhead of linked data is high, and the benefits are perhaps less than obvious. Update: for more o this see Dan Brickley's comments on "Semantic Web Interest Group now closed".

In this context, the rise of Wikidata is perhaps the most important development. One thing we'd hoped to do but didn't get that far was to set up our own instance of Wikibase to play with (Wikibase is the software that Wikidata runs on). This is actually pretty straightforward to do if you have Docker installed, see this great post in Medium Wikibase for Research Infrastructure — Part 1 by Matt Miller, which I stumbled across after discovering Bob DuCharme's blog post Running and querying my own Wikibase instance. Running Wikibase on your own machine (if you follow the instructions you also get the SPARQL query interface) means that you can play around with a knowledge graph without worrying about messing up Wikidata itself, or having to negotiate with the Wikidata community if you want to add new properties. It looks like a relatively painless way to discover whether knowledge graphs are appropriate for the problem you're trying to solve. I hope to find time to play with Wikibase further in the future.

I'll update this blog post as the hackathon report is written.

GBIF Ebbe Nielsen Challenge update

Quick note to express my delight and surprise that my entry for the 2018 GBIF Ebbe Nielsen Challenge come in joint first! My entry was Ozymandias - a biodiversity knowledge graph which built upon data from sources such as ALA, AFD, BioStor, CrossRef, ORCID), Wikispecies, and BLR.

I'm still tweaking Ozymandias, for example adding data on GBIF specimens (and maybe sequences from GenBank and BOLD) so that I can explore questions such as what is the lag time between specimen collection and description of a species. The bigger question I'm interested in is the extent to which knowledge graphs (aka RDF) can be used to explore biodiversity data.

For details on the other entries visit the list of winners at GBIF. The other first place winners Lien Reyserhove, Damiano Oldoni and Peter Desmet have generously donated half their prize to NumFOCUS which supports open source data science software:

This is a great way of acknowledging the debt many of us owe to developers of open source software that underpins the work of many researchers.

I hope GBIF and the wiser GBIF community found this year's Challenge to be worthwhile, I'm a big fan of anything which increases GBIF's engagement with developers and data analysts, and if the challenge runs again next year I encourage anyone with an interest in biodiversity informatics to consider taking part.