iPhylo: Zenodo

Roderic D. M. Page

Showing posts with label Zenodo. Show all posts

Friday, November 10, 2017

Exploring images in the Biodiversity Literature Repository

A post by on the Plaza blog Expanded access to images in the Biodiversity Literature Repository has prompted me to write up a little toy I created earlier this week.

The Biodiversity Literature Repository (BLR) is a repository of taxonomic papers hosted by Zenodo. Where possible Plazi have extracted individual images and added those to the BLR, even if the article itself is not open access. The justification for being able to do this is presented here: DOI:10.1101/087015. I'm not entirely convinced by their argument (see Copyright and the Use of Images as Biodiversity Data) but rather than rehash that argument I decide dit would be much more fun to get a sense of what is in the BLR. I built a tool to scrape data from Zenodo and store it in CouchDB, put a simple search engine on top (using the search functionality in Cloudant) to search within the figure captions, and wrote some code to use a cloud-based image server to generate thumbnails for the images in Zenodo (some of which are quite big). The tool is hosted at Heroku, you can try it out here: https://zenodo-blr-interface.herokuapp.com/.

This is not going to win any design awards, I'm simply trying to get a feel for what imagery BLR has. My initial reactions was "wow!". There's a rich range of images, including phylogenies, type specimens, habitats, and more. Searching by museum codes, e.g. NHMUK is a quick way to discover images of specimens from various collections.

Based on this experiment there are at least two things I think would be fun to do.

Adding more images

BLR already has a lot of images, but the biodiversity literature is huge, and there's a wealth of imagery elsewhere, including journals not in BLR, and of course the Biodiversity Heritage Library (BHL). Extracting images from articles in BHL would potentially add a vast number of additional images.

Machine learning

Machine learning is hot right now, and anyone using iNaturalist is probably aware of their use of computer vision to suggest identifications for images you upload. It would be fascinating to apply machine learning to images in the BLR. Even basic things such as determining whether an image is a photo or a drawing, how many specimens are included, what the specimen orientation is, what part of the organism is being displayed, is the image a map (and of what country) would be useful. There's huge scope here for doing something interesting with these images.

The toy I created is very basic, and merely scratches the surface of what could be done (Plazi have also created their own tool, see http://github.com/punkish/zenodeo). But spending a few minutes browsing the images is well worthwhile, and if nothing else is a reminder of both how diverse life is, and how active taxonomists are in trying to discover and describe that diversity.

Thursday, September 17, 2015

On having multiple DOI registration agencies for the same journal

On Friday I discovered that BHL has started issuing CrossRef DOIs for articles, starting with the journal Revue Suisse de Zoologie. The metadata for these articles comes from BioStor. After a WTF and WWIC moment, I tweeted about this, and something of a Twitter storm (and email storm) ensued:

.@BioDivLibrary WTF?! When did #bhlib start minting DOIs for articles ("parts") e.g. http://t.co/xm9xYS62cr +1, but a little heads up maybe
— Roderic Page (@rdmpage) September 11, 2015

To be clear, I'm very happy that BHL is finally assigning article-level DOIs, and that it is doing this via CrossRef. Readers of this blog may recall an earlier discussion about the relative merits of different types of DOIs, especially in the context of identifiers for articles. The bulk of the academic literature has DOIs issued by CrossRef, and these come with lots of nice services that make them a joy to use if you are a data aggregator, like me. There are other DOI registration agencies minting DOIs for articles, such as Airiti Library in Taiwan (e.g., doi:10.6165/tai.1998.43(2).150) and ISTIC (中文DOI) in China (e.g., doi:10.3969/j.issn.1000-7083.2014.05.020) (pro tip, if you want to find out the registration agency for a DOI, simply append it to http://doi.crossref.org/doiRA/, e.g. http://doi.crossref.org/doiRA/10.6165/tai.1998.43(2).150). These provide stable identifiers, but not the services needed to match existing bibliographic data to the corresponding DOI (as I discovered to my cost while working with IPNI).

However, now things get a little messy. From 2015 PDFs for Revue Suisse de Zoologie are being uploaded to Zenodo, and are getting DataCite DOIs there (e.g., doi:10.5281/zenodo.30012). This means that the most recent articles for this journal will not have CrossRef DOIs. From my perspective, this is a disappointing move. It removes the journal from the CrossRef ecosystem at a time when the uptake of CrossRef DOIs for taxonomic journals is at an all time high (both ZooKeys and Zootaxa have CrossRef DOIs), and now BHL is starting to issue CrossRef DOIs for the "legacy" literature (bear in mind that "legacy" in this context can mean articles published last year).

I've rehearsed the reasons why I think CrossRef DOIs are best elsewhere, but the keys points are that articles are much easier to discover (e.g., using http://search.crossref.org), and are automatically first class citizens of the academic literature. However, not everybody buys these arguments.

Maybe a way forward is to treat the two types of DOI as identifying two different things. The CrossRef DOI identifies the article, not a particular representation. The Zenodo DOI (or any DataCite DOI) for a PDF identifies that representation (i.e., the PDF), not the article.

Having CrossRef and Zenodo DataCite DOIs coexist

This would enable CrossRef and Zenod DOIs to coexist, providing we can (a) have some way of describing the relationship between the two kinds of DOI (e.g., CrossRef DOI - hasRepresentation -> Zenodo DOI).

This would give freedom to those who want the biodiversity literature to be part of the wider CrossRef community to mint CrossRef DOIs to do so. It gives those articles the benefits that come with CrossRef DOIs (findability, being included in lists of literature cited, citation statistics, customer support when DOIs break, altmetrics, etc.)

It would also enable those who want to ensure stable access to the contents of the biodiversity literature to use archives such as Zenodo, and have the benefits of those DOIs (stability, altmetrics, free file storage and free DOIs).

Having multiple DOIs for the same thing is, I'd argue, at the very least, unhelpful. But if we tease apart the notion of what we are identifying, maybe they can coexist. Otherwise I think we are in danger of making choices that, while they seem locally optimal (e.g., free storage and minting of DOIs), may in the long run cause problems and run counter to the goal of making the taxonomic literature has findable as the wider literature.

Wednesday, January 28, 2015

Annotating GBIF, from datasets to nanopublications

Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.

Duplication

As an example, consider the following two occurrences for Psilogramma menephron:

occurrence	taxon	longitude	latitude	catalogue number	sequence
887386322	Psilogramma menephron Cramer, 1780	145.86301	-17.44	BC ZSM Lep 01337
1009633027	Psilogramma menephron Cramer, 1780	145.86	-17.44	KJ168695	KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIs

In terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.

Nanopublications

If we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are:

A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.

A nanopublication comprises three elements:

The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112

Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

Assertion
- 668534424 identifiedAs Rhacophorus borneensis
Provenance
- doi:10.5358/hsj.32.112
PublicationInfo
- author: 0000-0002-7101-9767
- identifier doi:xxxx

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?

Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.

Summary

My goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.

Thursday, May 15, 2014

DOIs are not enough

.@catapanoth Agreed, it’s just that I think there’s a much bigger win to be had by exploiting existing services instead of going it alone…
— Roderic Page (@rdmpage) May 15, 2014

I had a long Twitter conversation with Terry Catapano (@catapanoth) today, and as can happen with a distracted stream of tweets, I think we were a little at cross purposes. This blog post is an attempt to unpack the debate.

What prompted the conversation was the following paper:

Emery, Carlo et al (1899). Formiche di Madagascar raccolte dal Sig. A. Mocquerys nei pressi della Baia di Antongil (1897-1898).. Bullettino della Società Entomologica Italiana: 31 (1899) pp. 263-290. 10.5281/zenodo.9785

Not the paper so much, as the fact that it is stored on the Zenodo repository (which I was only looking at because of the announcement that GitHub now supports DOIs through Zenodo). Given that the PDF for Emery's paper was uploaded by the Plazi project, I wondered what was the intention of assigning a Zenodo DOI to this paper, rather than one from CrossRef.

Not all DOIs are equal

As Geoffrey Bilder notes in his post DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right?

...some have adopted a cargo-cult practice of seeing the mere presence of a DOI on a publication as a putative sign of “citability” or “authority.”

There is a danger that we fall into the trap of thinking that all we need to do is slap a DOI on a paper and all the good things that we associate with DOIs will magically happen. This isn't the case. Not all DOIs are the same. Zenodo DOIs are proved by DataCite, and DataCite DOis don't have all the features that CrossRef provides for their DOIs.

CrossRef provides some key services, one of the most important is discoverability. Given a bibliographic references, CrossRef has tools that can find whether it has a DOI (e.g., http://search.crossref.org). I use this a lot to map taxonomic papers to DOIs (by a lot I mean searching for DOIs for tens of thousands of articles). Most people don't do this, but you benefit from this service every time you read an article and see the literature cited section decorated with DOIs. Publishers use CrossRef's tools to convert citations from dumb strings to useful links. This feature we come to expect from any modern article relies on CrossRef have definitive metadata for lots (millions) or articles, all of which have DOIs. When publishers submit article metadata when registering their DOIs, they usually submit lists of literature cited (and the DOIs). This means that CrossRef is building a citation database, which you can see if you visit the web page for an article and see a "cited by" link.

Then there are additional services. Given that CrossRef has high quality bibliographic metadata for articles, if you have a DOI there is no need to type in the details of a paper. Most bibliographic software such as Mendeley and Zotero can take a DOI and flesh out those details for you. If a DOI fails to resolve, you can contact CrossRef Support and have somebody investigate. Then there are the new services such as FundRef and Prospect, which provide information on who funded a paper, and what text and data mining rights are available for a paper.

Why use DOIs?

The rationale for using DOIs for articles is so that they can be unambiguously identified, which in turn means we can build a robust citation network. But this requires infrastructure, and that is what CrossRef provides through tools like citation to DOI matching. Other DOI registration agencies don't do this, and CrossRef isn't aware of other DOIs, so putting, say, a DataCite DOI (such as those used by Zenodo) on an article doesn't achieve the primary goal of a DOI (embedding it in the citation graph of academic literature).

Hence, I regard putting a Zenodo DOI as basically a wasted opportunity. If we aren't making the primary biodiversity literature discoverable, and hence linkable, then all we are doing is keeping that literature in a ghetto (and reinforcing the impression that this literature, and taxonomy itself, really doesn't matter). It is striking that if you read a recent paper that describes a new species, the bulk of the systematic or ecological literature has DOIs, but the bulk of the taxonomic literature does not. If it doesn't have a CrossRef DOI, it's effectively invisible. All academic literature should get first class DOIs. Whether it's "legacy" or not is irrelevant, the Royal Society of London has DOIs on articles going back to 1800, these are now as accessible as any paper published today.

Eyes on the prize

So, if we are going to bring the taxonomic literature into the mainstream, make it discoverable and citable, then we should focus on bringing that literature into CrossRef's infrastructure. Archives like JSTOR do it, the Biodiversity Heritage Library (BHL) does it for some of its content (and they should be doing it at article level, right now).

One response to this is to say "but doesn't this cost money?" Of course it does. Everything does, nothing is free. What frustrates me most about this is that it's the wrong question. The first question should not be "how much does this cost?". If it is, you've already lost sight of the goal. Instead, we should be asking, "What do we want? Where do we need to be able to do to progress our field?". Once we articulate that, then we figure out how to pay for it. And we figure that out because we've decided this is what we need.

I think we want discoverable, citable taxonomic literature, embedded in the rest of the scientific literature and the publishing process. We don't get that by simply buying the cheapest DOIs available and slapping them on articles. To do so is to fundamentally misunderstand why DOIs matter, and to ignore the role that infrastructure plays in their success in academic publishing.