Wednesday, May 30, 2007

AMNH, DSpace, and OpenURL

Hate my tribe. Hate them for even asking why nobody uses library standards in the larger world, when “brain-dead inflexibility in practice” is one obvious and compelling reason, and “incomprehensibility” is another.

... $DEITY have mercy, OpenURL is a stupid spec. Great idea, and useful in spite of itself. But astoundingly stupid. Ranganathan preserve us from librarians writing specs! - Caveat Lector

OK, we're on a roll. After adding Journal of Arachnology and Pysche to my OpenURL resolver, I've no added the American Museum of Natural History's Bulletins and Novitates.

In an act of great generosity, the AMNH has placed its publications on a freely accessible DSpace server. This is a wonderful resource provided by one of the world's premier natural history museums (and one others should follow), and is especially valuable given that volumes of the Bulletins and Novitates post 1999 are also hosted by BioOne (and hence have DOIs), but these versions of the publications are not free.

As blogged earlier on SemAnt, getting metadata from DSpace in a actually usable form is a real pain. I ended up writing a script to pull everything off via the OAI interface, extract metadata from the resulting XML, do a DOI look-up for post 1999 material, then dump this into the MySQL server so my OpenURL service can find it.

Apart from the tedium of having to find the OAI interface (why oh why do people make this harder than it needs to be?), the metadata served up by the AMNH is, um, a little ropey. They use Dublin Core, which is great, but the AMNH makes a hash of using it. Dublin Core provides quite a rich set of terms for describing a reference, and guidelines on how to use it. The AMNH uses the same tag for different things. Take date, for example:


Now, one of these dates is the date of publication, the others are dates the metadata was uploaded (or so I suspect). So, why not use the appropriate terms? Like, for instance, <dcterms:created>. Why do I have to parse three fields, and intuit that the third one is the date of publication. Likewise, why have up to three <dc:title> fields, and why include an abbreviated citation in the title? And why for the love of God, format that citation differently for different articles!? Why have multiple <dc:description> fields, one of which is the abstract (and for which <dcterms:abstract> is available?). It's just a mess, and it's very annoying (as you can probably tell). I can see some hate library standards.

Anyway, after much use of Perl regular expressions, and some last minute finessing with Excel, I think we now have the AMNH journals available through OpenURL.

For a demo, go to David Shorthouse's list of references for spiders, say the letter P and click on the bioGUID symbol by a paper by Norm Platnick in the American Museum novitates.


Eric Muzzy said...

Good points Rob.
I wouldn't characterize the problems you're describing as arising out of library standards. The Dublin core was developed outside the library community, from computer science types who must have feared that if the specification was too difficult it wouldn't be adopted. Well, now it has, and of course it would be great if we could make all our data qualified.
As to the oai records our server is sending out, we are using code supplied by MIT's Dspace from a 2005-6 release. I'm not sure if there's more qualified dc in the current code. It is curious that the metadata you see in a publications page has fully qualified dublin core which reflects the fully-qualified dublin core I gave it during the publications ingest into the repository. Hmmm.

arhutch said...

I would only add that the Smithsonian Institution has also posted its scholarly series on the open web. See:

Rod Page said...

It's good to see that the Smithsonian is making some of its series available (being a zoologist by background, I'd overlooked the Smithsonian Contributions to Botany. I hope the Contributions to Zoology will also appear.

One worry, though is that the Smithsonian doesn't have persistent identifiers to these works. They use URLs, and rather fragile ones at that. For instance, this one which ends in "scb_RecordSingle.cfm?filename=sctb-0043". So, what if the Smithsonian decides to junk Cold Fusion (source of the ".cfm" extension) in favour of something else? All the URLs break. This is a Bad Thing ™ (see Cool URIs don't change).

The AMNH, in constrast, uses handles to indentify their digitial publications (the same technology that underlies DOIs). Hence, I can use the identifier hdl:2246/4228 to refer to, in this case, Archer, A. F. 1951. Studies in the orbweaving spiders (Argiopidae). 1. Am. Mus. Novit. 1487: 1-52. The handle makes no reference to the underlying technology, so if the AMNH switch from using DSpace to some other serving technology, the identifier remains unchanged.

This is one reason why we need GUIDs for literature.

Dorothea said...
This comment has been removed by the author.
Dorothea said...

Bleh, let me try that again.

DSpace actually knows the difference between those dates, and could conceivably emit qualified Dublin Core that specifies the difference. The problem here is the OAI spec, which REQUIRES that the base metadata (which nobody I know goes beyond) be unqualified Dublin Core.

Stupid. But there you are.