Thursday, May 15, 2014

DOIs are not enough

I had a long Twitter conversation with Terry Catapano (@catapanoth) today, and as can happen with a distracted stream of tweets, I think we were a little at cross purposes. This blog post is an attempt to unpack the debate.

What prompted the conversation was the following paper:
Emery, Carlo et al (1899). Formiche di Madagascar raccolte dal Sig. A. Mocquerys nei pressi della Baia di Antongil (1897-1898).. Bullettino della Società Entomologica Italiana: 31 (1899) pp. 263-290. 10.5281/zenodo.9785
Not the paper so much, as the fact that it is stored on the Zenodo repository (which I was only looking at because of the announcement that GitHub now supports DOIs through Zenodo). Given that the PDF for Emery's paper was uploaded by the Plazi project, I wondered what was the intention of assigning a Zenodo DOI to this paper, rather than one from CrossRef.

Not all DOIs are equal

As Geoffrey Bilder notes in his post DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right?
...some have adopted a cargo-cult practice of seeing the mere presence of a DOI on a publication as a putative sign of “citability” or “authority.”
There is a danger that we fall into the trap of thinking that all we need to do is slap a DOI on a paper and all the good things that we associate with DOIs will magically happen. This isn't the case. Not all DOIs are the same. Zenodo DOIs are proved by DataCite, and DataCite DOis don't have all the features that CrossRef provides for their DOIs.

CrossRef provides some key services, one of the most important is discoverability. Given a bibliographic references, CrossRef has tools that can find whether it has a DOI (e.g., I use this a lot to map taxonomic papers to DOIs (by a lot I mean searching for DOIs for tens of thousands of articles). Most people don't do this, but you benefit from this service every time you read an article and see the literature cited section decorated with DOIs. Publishers use CrossRef's tools to convert citations from dumb strings to useful links. This feature we come to expect from any modern article relies on CrossRef have definitive metadata for lots (millions) or articles, all of which have DOIs. When publishers submit article metadata when registering their DOIs, they usually submit lists of literature cited (and the DOIs). This means that CrossRef is building a citation database, which you can see if you visit the web page for an article and see a "cited by" link.

Then there are additional services. Given that CrossRef has high quality bibliographic metadata for articles, if you have a DOI there is no need to type in the details of a paper. Most bibliographic software such as Mendeley and Zotero can take a DOI and flesh out those details for you. If a DOI fails to resolve, you can contact CrossRef Support and have somebody investigate. Then there are the new services such as FundRef and Prospect, which provide information on who funded a paper, and what text and data mining rights are available for a paper.

Why use DOIs?

The rationale for using DOIs for articles is so that they can be unambiguously identified, which in turn means we can build a robust citation network. But this requires infrastructure, and that is what CrossRef provides through tools like citation to DOI matching. Other DOI registration agencies don't do this, and CrossRef isn't aware of other DOIs, so putting, say, a DataCite DOI (such as those used by Zenodo) on an article doesn't achieve the primary goal of a DOI (embedding it in the citation graph of academic literature).

Hence, I regard putting a Zenodo DOI as basically a wasted opportunity. If we aren't making the primary biodiversity literature discoverable, and hence linkable, then all we are doing is keeping that literature in a ghetto (and reinforcing the impression that this literature, and taxonomy itself, really doesn't matter). It is striking that if you read a recent paper that describes a new species, the bulk of the systematic or ecological literature has DOIs, but the bulk of the taxonomic literature does not. If it doesn't have a CrossRef DOI, it's effectively invisible. All academic literature should get first class DOIs. Whether it's "legacy" or not is irrelevant, the Royal Society of London has DOIs on articles going back to 1800, these are now as accessible as any paper published today.

Eyes on the prize

So, if we are going to bring the taxonomic literature into the mainstream, make it discoverable and citable, then we should focus on bringing that literature into CrossRef's infrastructure. Archives like JSTOR do it, the Biodiversity Heritage Library (BHL) does it for some of its content (and they should be doing it at article level, right now).

One response to this is to say "but doesn't this cost money?" Of course it does. Everything does, nothing is free. What frustrates me most about this is that it's the wrong question. The first question should not be "how much does this cost?". If it is, you've already lost sight of the goal. Instead, we should be asking, "What do we want? Where do we need to be able to do to progress our field?". Once we articulate that, then we figure out how to pay for it. And we figure that out because we've decided this is what we need.

I think we want discoverable, citable taxonomic literature, embedded in the rest of the scientific literature and the publishing process. We don't get that by simply buying the cheapest DOIs available and slapping them on articles. To do so is to fundamentally misunderstand why DOIs matter, and to ignore the role that infrastructure plays in their success in academic publishing.

Tuesday, May 06, 2014

Very large phylogeny viewer

As announced on phylobabble I've started to revisit visualising large phylogenies, building on some work I did a couple of years ago (my how time flies). This time, there is actual code (see as well as a live demo

You can see the amphibian tree below at


You can upload or paste a tree (for now in NEXUS format), or paste in a URL to a NEXUS file (e.g., from TreeBASE). I'll formats when I get the chance. The viewer uses the same approach as Google Maps, breaking the image of the tree into "tiles" of a fixed size, so even if the tree image is huge, the web browser only ever displays the same number of tiles. You can zoom in to see individual taxa, or zoom out for an overview. One reason I'm building this is to display DNA barcoding trees alongside the million DNA barcode map.

As ever this is a crude first attempt, but feel free to try it and let me know how you get on.