Monday, May 28, 2007

OpenURL and spiders

I'm not sure who said it first, but there's a librarianly spin on the old Perl paradigm I think I heard at code4libcon or Access in the past year: instead of "making simple things simple, and complex things possible," we librarians and those of us librarians who write standards tend, in writing our standards, to "make complex things possible, and make simple things complex."
That approach just won't cut it anymore.
-Dan Chudnov, Rethinking OpenURL


Time to bring some threads together. I've been working on a tool to parse references and find existing identifiers. The tool is at http://bioguid.info/references (for more on my bioGUID project see the blog). Basically, you paste in one or more references, and it tries to figure out what they are, using ParaTools and CrossRef's OpenURL resolver. For example, if you paste in this reference:
Vogel, B. R. 2004. A review of the spider genera Pardosa and Acantholycosa (Araneae, Lycosidae) of the 48 contiguous United States. J. Arachnol. 32: 55-108.

the service tells you that there is a DOI (doi:10.1636/H03-8).



OK, but what if there is no DOI? Every issue of the Journal of Arachnology is online, but only issues from 2000 onwards have DOIs (hosted by my favourite DOI breaker, BioOne). How do I link to the other articles?

One way is using OpenURL. What I've done is add an OpenURL service to bioGUID. If you send it a DOI, it simply redirects you to dx.doi.org to reoslve it. But I've started to expand it to handle papers that I know have no DOI. First up is the Journal of Arachnology. I used SiteSucker to pull all the HTML files listing the PDFs from the journal's web site. Then I ran a Perl script that read each HTML file and pulled out the links. There weren't terribly consistent, there are at least five or six different ways the links are written, but they are consistent enough to parse. What is especially nice is that the URLs include information on volume and starting page number, which greatly simplifies my task. So, this gives me list of over 1000 papers, each with a URL, and for each paper I have the journal, year, volume, and starting page. These four things are enough for me to uniquely identify the article. I then store all this information in a MySQL database, and when a user clicks on the OpenURL link in the list of results from the reference parser, if the journal is the Journal of Arachnology, you go straight to the PDF. Here's one to try.




Yeah, but what else can we do with this? Well, for one thing, you can use the bioGUID OpenURL service in Connotea. On the Advanced settings page you can set an OpenURL resolver. BY default I use CrossRef, but if you put "http://bioguid.info/openurl.php" as the Resolver URL, you will be able get full text for the Journal of Arachnology (providing that you've entered sufficient bibliographic details when saving the reference).

But I think the next step is to have a GUID for each paper, and in the absence of a DOI I'm a favour of SICI's (see my del.icio.us bookmarks for some background). For example, the paper above has the SICI 0161-8202(1988)16<47>2.0.CO;2-0. If this was a resolvable identifier, then we would have unique, stable identifiers for Journal of Arachnology papers that resolve to PDFs. Anybody making links between, say a scientific name and when it was published (e.g., catalogue of Life) could use the SICI as the GUID for the publication.

I need to play more with SICIs (and when I get the chance I'll write a post about that the different bits of a SICI mean), but for now before I forget I'll note that while writing code to generate SICIs for Journal of Arachnology I found a bug in the Perl module Algorithm-CheckDigits-0.44 that I use to compute the checksum for a SICI (the final character in the SICI). The checksum is based on a sum of the characters in the SICI modulo 37, but the code barfs if the sum is exactly divisible by 37 (i.e., the remainder is zero).

5 comments:

D. Eppstein said...

I tried copying and pasting in some doi-less references from Wikipedia (all consistently formatted using WP's citation templates), but it couldn't parse them. Maybe it should? Here's the list I tried:

Chandrasekaran, R.; Tamir, A. (1989). "Open questions concerning Weiszfeld's algorithm for the Fermat-Weber location problem".

Cockayne, E. J.; Melzak, Z. A. (1969). "Euclidean constructability in graph minimization problems.". Mathematics Magazine 42: 206–208.

Kuhn, H. W. (1973). "A note on Fermat's problem". Mathematical Programming 4: 98–107.

Wesolowsky, G. (1993). "The Weber problem: History and perspective". Location Science 1: 5–23.

Weiszfeld, E. (1937). "Sur le point pour lequel la somme des distances de n points donnes est minimum". Tohoku Math. Journal 43: 355–386.

Rod Page said...

David, sorry about this, there were two problems. The first is that the regular expressions I use didn't match the Wikipedia style. I've fixed this. The original code I use (ParaTools) comes with loads of templates, but I've commented these out with the aim of being able to easily debug just the ones I need.

The second issue is the presence of a "en dash" (–, Unicode symbol 2013, UTF8 E2 80 93) separating the pages, rather than a plain old hyphen (-). The Perl tools I use don't handle anything other than a hyphen. After some hair pulling I found a way to replace the en dash using a reguar expression ( s/\xe2\x80\x93/\-/g, courtesy of shell-monkey).

Now things should work better. The Kuhn paper is the only one with a DOI (doi:10.1007/BF01584648).

D. Eppstein said...

Thanks! It was a bit of an unfair test, as I omitted the refs that I were listed as having dois when I grabbed them from the article I used, so I'm not surprised so few had them.

Issue numbers are still confusing this, e.g. in

Stolarsky, Kenneth (1976). "Beatty sequences, continued fractions, and certain shift operators". Canadian Mathematical Bulletin 19 (4): 473–482.

but that's easily worked around. I can see your little script being quite useful...

Rod Page said...

OK, I've added issue handling, it should parse these OK now (fingers crossed). Thanks for the feedback.

sexy said...

情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣,情趣,情趣,情趣,情趣,情趣,情趣,情趣,A片,視訊聊天室,聊天室,視訊,視訊聊天室,080苗栗人聊天室,上班族聊天室,成人聊天室,中部人聊天室,一夜情聊天室,情色聊天室,視訊交友網

免費A片,AV女優,美女視訊,情色交友,免費AV,色情網站,辣妹視訊,美女交友,色情影片,成人影片,成人網站,A片,H漫,18成人,成人圖片,成人漫畫,情色網,日本A片,免費A片下載,性愛

A片,色情,成人,做愛,情色文學,A片下載,色情遊戲,色情影片,色情聊天室,情色電影,免費視訊,免費視訊聊天,免費視訊聊天室,一葉情貼圖片區,情色,情色視訊,免費成人影片,視訊交友,視訊聊天,視訊聊天室,言情小說,愛情小說,AIO,AV片,A漫,avdvd,聊天室,自拍,情色論壇,視訊美女,AV成人網,色情A片,SEX,成人論壇

情趣用品,A片,免費A片,AV女優,美女視訊,情色交友,色情網站,免費AV,辣妹視訊,美女交友,色情影片,成人網站,H漫,18成人,成人圖片,成人漫畫,成人影片,情色網


情趣用品,A片,免費A片,日本A片,A片下載,線上A片,成人電影,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,微風成人區,成人文章,成人影城,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,臺灣情色網,色情,情色電影,色情遊戲,嘟嘟情人色網,麗的色遊戲,情色論壇,色情網站,一葉情貼圖片區,做愛,性愛,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,美女交友,做愛影片

av,情趣用品,a片,成人電影,微風成人,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,愛情公寓,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,aio,av女優,AV,免費A片,日本a片,美女視訊,辣妹視訊,聊天室,美女交友,成人光碟

情趣用品.A片,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,色情遊戲,色情網站,聊天室,ut聊天室,豆豆聊天室,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,免費A片,日本a片,a片下載,線上a片,av女優,av,成人電影,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,成人網站,自拍,尋夢園聊天室