Monday, September 12, 2011

Duplicate DOIs for the same article: alias DOIs, who knew?

As part of a project to map taxonomic citations to bibliographic identifiers I'm tackling strings like this (from the ION record urn:lsid:organismnames.com:name:1405511 for Pseudomyrmex crudelis):

<tdwg_co:PublishedIn>
Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants. Zoological Journal of the Linnean Society, 126(4), August 1999: 451-540. 516 [Zoological Record Volume 136]
</tdwg_co:PublishedIn>

I parse the string into its components (e.g., journal, volume, issue, pagination) and use scripts to locate identifiers such as DOIs. I regard DOIs as the gold standard for bibliographic identifiers. The are (usually) unique, and CrossRef provides some really useful services to support them (DOIs now also support linked data if you are in to that sort of thing). Occasionally there are problems, such as duplicate DOIs when material moves from a publisher's site to, say, JSTOR. And some publishers are really, really bad at releasing DOIs that don't resolve. For example, Taylor & Francis Online have at least 18,000 DOIs for the Annals and Magazine of Natural History that don't resolve (e.g., doi:10.1080/00222933809512318 for this paper).

Sometimes my automated scripts for finding DOIs fail and I have to resort to Googling. To my surprise, I found two versions of the paper "Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants", each with a different DOI:


Now, this isn't supposed to happen. Interestingly, if you resolve doi:10.1006/zjls.1998.0158, either on the web or using CrossRef's OpenURL resolver, you get the page/metadata for doi:10.1111/j.1096-3642.1999.tb00157.x.

To see what was going on I fired up my local installation of Tony Hammnd's OpenHandle tool (see http://bioguid.info/openhandle/) and entered the Elsevier DOI (10.1006/zjls.1998.0158) and got this:


{
"comment" : "OpenHandle (JSON) - see http://code.google.com/p/openhandle/" ,
"handle" : "hdl:10.1006/zjls.1998.0158" ,
"handleStatus" : {
"code" : "1" ,
"message" : "SUCCESS"
} ,
"handleValues" : [
{
"index" : "100" ,
"type" : "HS_ADMIN" ,
"data" : {
"adminRef" : "hdl:10.1006/zjls.1998.0158?index=100" ,
"adminPermission" : "111111110111"
} ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Thu Apr 13 19:09:03 BST 2000" ,
"reference" : []
} ,
{
"index" : "1" ,
"type" : "URL" ,
"data" : "http://linkinghub.elsevier.com/retrieve/pii/S0024408298901583" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:12 BST 2003" ,
"reference" : []
} ,
{
"index" : "700050" ,
"type" : "700050" ,
"data" : "20030811104844000" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:16 BST 2003" ,
"reference" : []
} ,
{
"index" : "1970" ,
"type" : "HS_ALIAS" ,
"data" : "10.1111/j.1096-3642.1999.tb00157.x" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Mon Aug 25 21:06:50 BST 2008" ,
"reference" : []
}
]
}

The interesting bit is the "HS_ALIAS" at the bottom. I'd not come across this before, although it's in the spec (RFC 3651) for all to see (yeah, but who reads those?). The handle system that underlies DOIs has mechanism to support aliases, so that a DOI that originally pointed to a web page (say, for an article) can be redirected to point to another DOI. In this case, the Elsevier DOI redirects to the Wiley DOI ("10.1111/j.1096-3642.1999.tb00157.x" in the HS_ALIAS section), so the user ends up at Wiley's page for this article, not Elsevier's. This provides a way to accommodate changes in article ownership, without requiring an existing publisher to reuse the previous publisher's DOI.

In one sense this seems to defeat the point of DOIs, namely that they are effectively opaque identifiers that any publisher should be able to host. Perhaps in this case the issue is that the DOI prefix ("10.1006" and "10.1111" for Elsevier and Wiley, respectively) corresponds to a publisher, and when something goes wrong with a DOI it's easier to identify who is responsible based on this prefix, rather than the individual DOI.

In any event, next time I come across a duplicate DOI I'll need to check whether it is an alias of another DOI before launching into another rant about the (occasional) failings of DOIs.