Friday, July 17, 2020

Persistent Identifiers: A demo and a rant

This morning, as part of a webinar on persistent identifiers, I gave a live demo of a little toy to demonstrate linking together museum and herbaria specimens with publications that use those specimens. A video of an earlier run through of the demo appears below, for background on this demo see Diddling with semantic data: linking natural history collections to the scientific literature. The slides I used in this demo are available here: http://pid-demonstrator.herokuapp.com/demo/.

One thing which struck me during this webinar is that discussions about persistent identifiers (PIDs) (also called "GUIDs") seem to endlessly cycle through the same topics, it is as if each community has to reinvent everything and rehash the same debates before they reach a solution. An alternative is to try and learn from systems that work.

In this respect CrossRef is a great example:

  1. They have (mostly) brand neutral, actionable identifiers that resolve to something of value (DOIs resolve to an article you can read).
  2. They are persistent by virtue of redirection (a DOI is a pointer to a pointer to the thing, bit like a P.O. Box number). The identifiers are managed.
  3. They have machine readable identifiers that can be used to support an ecosystem of services (e.g., most references manages just need you to type in a DOI and they do the rest, altimetric “donuts", etc.).
  4. They have tools for discoverability, in other words, if you have the metadata for an article they can tell you what the corresponding DOI is.
  5. The identifiers deliver something of value to publishers that use them, such as the citation graph of interconnected articles (which means your article is automatically linked to other article) and you can get real time metrics of use (e.g., DOIs being cited in Wikipedia articles).
  6. There is a strong incentive for publishers to use other publisher's identifiers (DOIs) in their own content because if that is reciprocated then you get web traffic (and citation counts). If publishers use their own local identifiers for external content they lose out on these benefits.
  7. There are services to handle cases when things break. If a DOI doesn’t resolve, you can talk to a human who will attempt to fix it.

My point is that there is an entire ecosystem built around DOIs, and it works. Typically every community considering persistent identifiers attempts to build something themselves, ideally for “free", and ends up with a mess, or a system that doesn’t provide the expected benefits, because the message they got was “we need to get PIDs” rather than “build an infrastructure to enable these things that we need”.

I think we can also learn from systems that failed. In biology, LSIDs failed for a bunch of reasons, mainly because they didn’t resolve to anything useful (they were only machine readable). Also they were free, which seems great, except it means there is no cost to giving them up (which is exactly what people did). Every time someone advocates a PID that is free they are advocating a system that is designed to fail. Minting a UUID for everything costs nothing and is worth nothing. If you think a URL to a website in your organisation's domain is a persistent identifier, just think what happened to all those HTTP URLs stored in databases when post Edward Snowden the web switched to HTTPS.

One issue which came up in the Webinar was the status of ISBNs, which aren't actionable PIDs (in the sense that there's no obvious way to stick one in a web browser and get something back). ISBNs have failed to migrate to the web because, I suspect, they are commercially valuable, which means a fight over who exploits that value. Whoever provides the global resolver for ISBNs then gets to influence where you buy the corresponding book. Book publishers were slow to sell online, so Amazon gobbled up the online market, so in effect the defect global resolver (Amazon) makes all the money. The British Library got into trouble for exactly this reason when they provided links to Amazon (see British Library sparks Amazon row). Furthermore, unlike, DOIs for the scholarly literature, there aren’t really any network effects for ISBNs - publishers don’t benefit from other publishers having their content online. So I think ISBNs are an interesting example of the economics of identifiers, and the challenging problem of identifiers for things that are not specific to one organisation. It's much easier to think about identifiers for your stuff because you control that stuff and how it is represented. But who gets to decide on the PID for, say, Homo sapiens?

So, while we navigate the identifier acronym soup (DOI, LSID, ORCID, URL, URI, IRI, PURL, ARK, UUID, ISBN, Handles) and rehash arguments that multiple communities have been having for decades, maybe it's a good time to pause, take a look at other communities, and see what has worked and what hasn't, and why. It may well be that in many cases the kinds of drivers that make CrossRef a successes (identifiers return something of value, network effects as the citation graph grows, etc.) might not exist in many heritage situations, but that in itself would be useful to know, and might help explain why we have been sluggish in adopting persistent identifiers.