Monday, May 28, 2007

OpenURL and spiders

I'm not sure who said it first, but there's a librarianly spin on the old Perl paradigm I think I heard at code4libcon or Access in the past year: instead of "making simple things simple, and complex things possible," we librarians and those of us librarians who write standards tend, in writing our standards, to "make complex things possible, and make simple things complex."
That approach just won't cut it anymore.
-Dan Chudnov, Rethinking OpenURL

Time to bring some threads together. I've been working on a tool to parse references and find existing identifiers. The tool is at (for more on my bioGUID project see the blog). Basically, you paste in one or more references, and it tries to figure out what they are, using ParaTools and CrossRef's OpenURL resolver. For example, if you paste in this reference:
Vogel, B. R. 2004. A review of the spider genera Pardosa and Acantholycosa (Araneae, Lycosidae) of the 48 contiguous United States. J. Arachnol. 32: 55-108.

the service tells you that there is a DOI (doi:10.1636/H03-8).

OK, but what if there is no DOI? Every issue of the Journal of Arachnology is online, but only issues from 2000 onwards have DOIs (hosted by my favourite DOI breaker, BioOne). How do I link to the other articles?

One way is using OpenURL. What I've done is add an OpenURL service to bioGUID. If you send it a DOI, it simply redirects you to to reoslve it. But I've started to expand it to handle papers that I know have no DOI. First up is the Journal of Arachnology. I used SiteSucker to pull all the HTML files listing the PDFs from the journal's web site. Then I ran a Perl script that read each HTML file and pulled out the links. There weren't terribly consistent, there are at least five or six different ways the links are written, but they are consistent enough to parse. What is especially nice is that the URLs include information on volume and starting page number, which greatly simplifies my task. So, this gives me list of over 1000 papers, each with a URL, and for each paper I have the journal, year, volume, and starting page. These four things are enough for me to uniquely identify the article. I then store all this information in a MySQL database, and when a user clicks on the OpenURL link in the list of results from the reference parser, if the journal is the Journal of Arachnology, you go straight to the PDF. Here's one to try.

Yeah, but what else can we do with this? Well, for one thing, you can use the bioGUID OpenURL service in Connotea. On the Advanced settings page you can set an OpenURL resolver. BY default I use CrossRef, but if you put "" as the Resolver URL, you will be able get full text for the Journal of Arachnology (providing that you've entered sufficient bibliographic details when saving the reference).

But I think the next step is to have a GUID for each paper, and in the absence of a DOI I'm a favour of SICI's (see my bookmarks for some background). For example, the paper above has the SICI 0161-8202(1988)16<47>2.0.CO;2-0. If this was a resolvable identifier, then we would have unique, stable identifiers for Journal of Arachnology papers that resolve to PDFs. Anybody making links between, say a scientific name and when it was published (e.g., catalogue of Life) could use the SICI as the GUID for the publication.

I need to play more with SICIs (and when I get the chance I'll write a post about that the different bits of a SICI mean), but for now before I forget I'll note that while writing code to generate SICIs for Journal of Arachnology I found a bug in the Perl module Algorithm-CheckDigits-0.44 that I use to compute the checksum for a SICI (the final character in the SICI). The checksum is based on a sum of the characters in the SICI modulo 37, but the code barfs if the sum is exactly divisible by 37 (i.e., the remainder is zero).


D. Eppstein said...

I tried copying and pasting in some doi-less references from Wikipedia (all consistently formatted using WP's citation templates), but it couldn't parse them. Maybe it should? Here's the list I tried:

Chandrasekaran, R.; Tamir, A. (1989). "Open questions concerning Weiszfeld's algorithm for the Fermat-Weber location problem".

Cockayne, E. J.; Melzak, Z. A. (1969). "Euclidean constructability in graph minimization problems.". Mathematics Magazine 42: 206–208.

Kuhn, H. W. (1973). "A note on Fermat's problem". Mathematical Programming 4: 98–107.

Wesolowsky, G. (1993). "The Weber problem: History and perspective". Location Science 1: 5–23.

Weiszfeld, E. (1937). "Sur le point pour lequel la somme des distances de n points donnes est minimum". Tohoku Math. Journal 43: 355–386.

Rod Page said...

David, sorry about this, there were two problems. The first is that the regular expressions I use didn't match the Wikipedia style. I've fixed this. The original code I use (ParaTools) comes with loads of templates, but I've commented these out with the aim of being able to easily debug just the ones I need.

The second issue is the presence of a "en dash" (–, Unicode symbol 2013, UTF8 E2 80 93) separating the pages, rather than a plain old hyphen (-). The Perl tools I use don't handle anything other than a hyphen. After some hair pulling I found a way to replace the en dash using a reguar expression ( s/\xe2\x80\x93/\-/g, courtesy of shell-monkey).

Now things should work better. The Kuhn paper is the only one with a DOI (doi:10.1007/BF01584648).

D. Eppstein said...

Thanks! It was a bit of an unfair test, as I omitted the refs that I were listed as having dois when I grabbed them from the article I used, so I'm not surprised so few had them.

Issue numbers are still confusing this, e.g. in

Stolarsky, Kenneth (1976). "Beatty sequences, continued fractions, and certain shift operators". Canadian Mathematical Bulletin 19 (4): 473–482.

but that's easily worked around. I can see your little script being quite useful...

Rod Page said...

OK, I've added issue handling, it should parse these OK now (fingers crossed). Thanks for the feedback.

sexy said...