I'm not sure who said it first, but there's a librarianly spin on the old Perl paradigm I think I heard at code4libcon or Access in the past year: instead of "making simple things simple, and complex things possible," we librarians and those of us librarians who write standards tend, in writing our standards, to "make complex things possible, and make simple things complex."
That approach just won't cut it anymore.
-Dan Chudnov, Rethinking OpenURL
Time to bring some threads together. I've been working on a tool to parse references and find existing identifiers. The tool is at http://bioguid.info/references (for more on my bioGUID project see the blog). Basically, you paste in one or more references, and it tries to figure out what they are, using ParaTools and CrossRef's OpenURL resolver. For example, if you paste in this reference:
Vogel, B. R. 2004. A review of the spider genera Pardosa and Acantholycosa (Araneae, Lycosidae) of the 48 contiguous United States. J. Arachnol. 32: 55-108.
the service tells you that there is a DOI (doi:10.1636/H03-8).
OK, but what if there is no DOI? Every issue of the Journal of Arachnology is online, but only issues from 2000 onwards have DOIs (hosted by my favourite DOI breaker, BioOne). How do I link to the other articles?
One way is using OpenURL. What I've done is add an OpenURL service to bioGUID. If you send it a DOI, it simply redirects you to dx.doi.org to reoslve it. But I've started to expand it to handle papers that I know have no DOI. First up is the Journal of Arachnology. I used SiteSucker to pull all the HTML files listing the PDFs from the journal's web site. Then I ran a Perl script that read each HTML file and pulled out the links. There weren't terribly consistent, there are at least five or six different ways the links are written, but they are consistent enough to parse. What is especially nice is that the URLs include information on volume and starting page number, which greatly simplifies my task. So, this gives me list of over 1000 papers, each with a URL, and for each paper I have the journal, year, volume, and starting page. These four things are enough for me to uniquely identify the article. I then store all this information in a MySQL database, and when a user clicks on the OpenURL link in the list of results from the reference parser, if the journal is the Journal of Arachnology, you go straight to the PDF. Here's one to try.
Yeah, but what else can we do with this? Well, for one thing, you can use the bioGUID OpenURL service in Connotea. On the Advanced settings page you can set an OpenURL resolver. BY default I use CrossRef, but if you put "http://bioguid.info/openurl.php" as the Resolver URL, you will be able get full text for the Journal of Arachnology (providing that you've entered sufficient bibliographic details when saving the reference).
But I think the next step is to have a GUID for each paper, and in the absence of a DOI I'm a favour of SICI's (see my del.icio.us bookmarks for some background). For example, the paper above has the SICI 0161-8202(1988)16<47>2.0.CO;2-0. If this was a resolvable identifier, then we would have unique, stable identifiers for Journal of Arachnology papers that resolve to PDFs. Anybody making links between, say a scientific name and when it was published (e.g., catalogue of Life) could use the SICI as the GUID for the publication.
I need to play more with SICIs (and when I get the chance I'll write a post about that the different bits of a SICI mean), but for now before I forget I'll note that while writing code to generate SICIs for Journal of Arachnology I found a bug in the Perl module Algorithm-CheckDigits-0.44 that I use to compute the checksum for a SICI (the final character in the SICI). The checksum is based on a sum of the characters in the SICI modulo 37, but the code barfs if the sum is exactly divisible by 37 (i.e., the remainder is zero).