iPhylo: May 2007

Roderic D. M. Page

Wednesday, May 30, 2007

AMNH, DSpace, and OpenURL

Hate my tribe. Hate them for even asking why nobody uses library standards in the larger world, when “brain-dead inflexibility in practice” is one obvious and compelling reason, and “incomprehensibility” is another.

... $DEITY have mercy, OpenURL is a stupid spec. Great idea, and useful in spite of itself. But astoundingly stupid. Ranganathan preserve us from librarians writing specs! - Caveat Lector

OK, we're on a roll. After adding Journal of Arachnology and Pysche to my OpenURL resolver, I've no added the American Museum of Natural History's Bulletins and Novitates.

In an act of great generosity, the AMNH has placed its publications on a freely accessible DSpace server. This is a wonderful resource provided by one of the world's premier natural history museums (and one others should follow), and is especially valuable given that volumes of the Bulletins and Novitates post 1999 are also hosted by BioOne (and hence have DOIs), but these versions of the publications are not free.

As blogged earlier on SemAnt, getting metadata from DSpace in a actually usable form is a real pain. I ended up writing a script to pull everything off via the OAI interface, extract metadata from the resulting XML, do a DOI look-up for post 1999 material, then dump this into the MySQL server so my OpenURL service can find it.

Apart from the tedium of having to find the OAI interface (why oh why do people make this harder than it needs to be?), the metadata served up by the AMNH is, um, a little ropey. They use Dublin Core, which is great, but the AMNH makes a hash of using it. Dublin Core provides quite a rich set of terms for describing a reference, and guidelines on how to use it. The AMNH uses the same tag for different things. Take date, for example:


<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>1946</dc:date>

Now, one of these dates is the date of publication, the others are dates the metadata was uploaded (or so I suspect). So, why not use the appropriate terms? Like, for instance, <dcterms:created>. Why do I have to parse three fields, and intuit that the third one is the date of publication. Likewise, why have up to three <dc:title> fields, and why include an abbreviated citation in the title? And why for the love of God, format that citation differently for different articles!? Why have multiple <dc:description> fields, one of which is the abstract (and for which <dcterms:abstract> is available?). It's just a mess, and it's very annoying (as you can probably tell). I can see some hate library standards.

Anyway, after much use of Perl regular expressions, and some last minute finessing with Excel, I think we now have the AMNH journals available through OpenURL.

For a demo, go to David Shorthouse's list of references for spiders, say the letter P and click on the bioGUID symbol by a paper by Norm Platnick in the American Museum novitates.

Monday, May 28, 2007

OpenURL and spiders

I'm not sure who said it first, but there's a librarianly spin on the old Perl paradigm I think I heard at code4libcon or Access in the past year: instead of "making simple things simple, and complex things possible," we librarians and those of us librarians who write standards tend, in writing our standards, to "make complex things possible, and make simple things complex."
That approach just won't cut it anymore.
-Dan Chudnov, Rethinking OpenURL

Time to bring some threads together. I've been working on a tool to parse references and find existing identifiers. The tool is at http://bioguid.info/references (for more on my bioGUID project see the blog). Basically, you paste in one or more references, and it tries to figure out what they are, using ParaTools and CrossRef's OpenURL resolver. For example, if you paste in this reference:

Vogel, B. R. 2004. A review of the spider genera Pardosa and Acantholycosa (Araneae, Lycosidae) of the 48 contiguous United States. J. Arachnol. 32: 55-108.

the service tells you that there is a DOI (doi:10.1636/H03-8).

OK, but what if there is no DOI? Every issue of the Journal of Arachnology is online, but only issues from 2000 onwards have DOIs (hosted by my favourite DOI breaker, BioOne). How do I link to the other articles?

One way is using OpenURL. What I've done is add an OpenURL service to bioGUID. If you send it a DOI, it simply redirects you to dx.doi.org to reoslve it. But I've started to expand it to handle papers that I know have no DOI. First up is the Journal of Arachnology. I used SiteSucker to pull all the HTML files listing the PDFs from the journal's web site. Then I ran a Perl script that read each HTML file and pulled out the links. There weren't terribly consistent, there are at least five or six different ways the links are written, but they are consistent enough to parse. What is especially nice is that the URLs include information on volume and starting page number, which greatly simplifies my task. So, this gives me list of over 1000 papers, each with a URL, and for each paper I have the journal, year, volume, and starting page. These four things are enough for me to uniquely identify the article. I then store all this information in a MySQL database, and when a user clicks on the OpenURL link in the list of results from the reference parser, if the journal is the Journal of Arachnology, you go straight to the PDF. Here's one to try.

Yeah, but what else can we do with this? Well, for one thing, you can use the bioGUID OpenURL service in Connotea. On the Advanced settings page you can set an OpenURL resolver. BY default I use CrossRef, but if you put "http://bioguid.info/openurl.php" as the Resolver URL, you will be able get full text for the Journal of Arachnology (providing that you've entered sufficient bibliographic details when saving the reference).

But I think the next step is to have a GUID for each paper, and in the absence of a DOI I'm a favour of SICI's (see my del.icio.us bookmarks for some background). For example, the paper above has the SICI 0161-8202(1988)16<47>2.0.CO;2-0. If this was a resolvable identifier, then we would have unique, stable identifiers for Journal of Arachnology papers that resolve to PDFs. Anybody making links between, say a scientific name and when it was published (e.g., catalogue of Life) could use the SICI as the GUID for the publication.

I need to play more with SICIs (and when I get the chance I'll write a post about that the different bits of a SICI mean), but for now before I forget I'll note that while writing code to generate SICIs for Journal of Arachnology I found a bug in the Perl module Algorithm-CheckDigits-0.44 that I use to compute the checksum for a SICI (the final character in the SICI). The checksum is based on a sum of the characters in the SICI modulo 37, but the code barfs if the sum is exactly divisible by 37 (i.e., the remainder is zero).

Biodiversity Heritage Library blog - look but don't touch

Added the Biodiversity Heritage Library blog to my links on my blog, then noticed that BHL have disabled comments. So, we can view their progress, but can't leave comments. Sigh, I wonder whether BHL has quite grasped that one of the best uses of a blog is to interact with the people who leave comments, in other words, have a conversation. If nothing else, at least it gives you some idea of whether people are actually reading it.
That said, the stuff BHL are putting up on the Internet Archive looks cool. However, how do we find this stuff, and link to it (identifiers anyone?). Wouldn't the BHL blog be a great place to have this conversation...?

Wednesday, May 23, 2007

iTunes, embedded metadata, and DNA barcoding

Continuing on this theme of embedded metadata, this is one reason why DNA barcodingis so appealing. A DNA barcode is rather like embedded metadata -- once we extract it we can look up the sequence and determine the organism's identity (or, at least whether we've seen it before). It's very like identifying a CD based on a hash computed from the track lengths. Traditional identification is more complicated, involves more nebulous data (lets see, my frog has two bumps on the head, gee, are those things in that picture of a frog bumps?), much of which is not online.

Tuesday, May 22, 2007

XMP

Following on from the previous post, as Howison and Goodrum note, Adobe provides XMP as a way to store metadata in files, such as PDFs. XMP supports RDF and namespaces, which means widely used bibliographic standards such as Dublin Core and PRISM can be embedded in a PDF, so the article doesn't become separated from its metadata. Adobe provides a developers kit under a BSD license.

The vision of managing digital papers being as easy as managing digital music is really compelling. Imagine auto populating bibliographic management software simply by adding the PDF! Note also that the Creative Commons supports XMP.

Monday, May 21, 2007

iTunes and citation metadata.

Stumbled across a really nice paper (Why can't I manage academic papers like MP3s?) while reading commentary on Tim O'Reilley's post about FreeBase.

In response to Danny Ayer's post, Tim O'Reilly wrote

I think you miss my point. I wasn't using centralized vs. decentralized as the point of contrast. I was using pre-specified vs. post-specified.

Now, I know you are much closer to all the sem web discussions than I am, and I probably have mischaracterized them somewhat. But you have to ask why are they so widely mischaracterized? There's some fire to go with that smoke.

In a very different context, on a mailing list, Bill Janssen wrote something very apposite:

"Let me recommend the paper "Why can't I manage academic papers like MP3s?", (yes, I realize accepted standards say I should put that comma inside the quotation marks) by James Howison and Abby Goodrum, at http://www.freelancepropaganda.com/archives/MP3vPDF.pdf. The basic thesis is that our common document formats weren't designed for use with digital repositories, and that metadata standards are often designed for the use of librarians and publishers, who have different metadata concerns than end-users have."

That's the distinction between the Semantic Web and Web 2.0 that I was trying to get at.

Howison and Goodrum make some interesting points, especially about how easy it is to get (or create) metadata for a CD, especially when compared to handling academic literature. Charlie Rapple on All My Ey suggests that

While the authors' [Howison and Goodrum] pains have to an extent been resolved since, by online reference management/bookmarking tools such as Connotea or CiteULike (which both launched later that year), and by the increase in XML as a format for online articles (which unites the full text and metadata in one file), their issues with full text availability remain.

I think the pain is still there, especially as Connotea relies on articles having a DOI (or some other URI or identifier). Many articles don't have DOIs. Furthermore, often a paper will have a DOI but that DOI is not printed on the article (either the hard copy of the PDF). This is obviously true for articles that were published before DOIs came in to existence, but which now have DOIs, however it is also the case for some recent articles as well. This means we need to use metadata about the article to try and find a DOI. In contrast, programs like iTunes use databases such as Gracenote CDDB to retrieve metadata for a CD, where the CD's identity is computed based on information on the CD itself (i.e., the track length). The identifier is computed from the object at had.
This one reason why I like SICIs (Serial Item and Contribution Identifier, see my del.icio.us bookmarks for sici for some background). These can be computed from metadata about an individual article, often using just information printed with the article (although the ISSN number might not be). This, coupled with the collaborative nature of CD databases such as CDDB and freedb (users supply missing metadata) makes them a useful illustration for how we might construct a database of taxonomic literature. Users could contribute metadata about papers, with identifiers computed from the papers themselves.

Saturday, May 19, 2007

EoL in the blogsphere

postgenomic is a great way to keep up with science blogs. For example, searching for encyclopedia of life pulls up all sorts of interesting posts. A sampling:

Island of doubt

There is simply no way around this taxonomic deficit. While the EOL won't by itself answer too many questions, by drawing attention to how much work remains before we begin to get a grip on the ecosystems we are already manipulating beyond recognition, maybe, just maybe, we can re-distribute some of our research resources to that less glamorous pursuit known an inventory control.

SciGuy

PRO: Jonathan Fanton, president of the MacArthur Foundation. This is certainly going to advance the science of identification, and the science behind biodiversity.
CON: Dan Graur, a University of Houston professor of biology. I'm skeptical. Some of this knowledge goes back to the 18th century. It's all very nice, but this is not a scientific endeavor, it's an editorial effort. I'm a scientist, I like new knowledge.

My Biotech Life

That flash phylogenetic tree just blew my mind. And the level bar that hides/shows information depending your knowledge level, also cool!

Pharnagula

I don't mean to sound so negative, since I think it's an eminently laudable goal, but I get very, very suspicious when I see all the initial efforts loaded towards building a pretty front end while the complicated core of the project is kept out of focus. I'd be more impressed with something like NCBI Entrez, which, while not as attractive as the EOL mockups, at least starts with the complicated business of integrating multiple databases. I want to see unlovely functionality first, before they try to entice me with a pretty face.

These are not the only blogs, and as always the comments left by others on these blogs is also fascinating. My sense is there is a "wow" factor based on the the publicity, coupled with not inconsiderable skepticism about content.

Friday, May 18, 2007

TBMap paper out

My paper on mapping TreeBASE names to other databases is out as provisional PDF on the BMC Bioinformatics web site (doi:10.1186/1471-2105-8-158 -- not working yet).

The abstract:

TreeBASE is currently the only available large-scale database of published organismal phylogenies. Its utility is hampered by a lack of taxonomic consistency, both within the database, and with names of organisms in external genomic, specimen, and taxonomic databases. The extent to which the phylogenetic knowledge in TreeBASE becomes integrated with these other sources is limited by this lack of consistency.
Taxonomic names in TreeBASE were mapped onto names in the external taxonomic databases IPNI, ITIS, NCBI, and uBio, and graph G of these mappings was constructed. Additional edges representing taxonomic synonymies were added to G, then all components of G were extracted. These components correspond to "name clusters", and group together names in TreeBASE that are inferred to refer to the same taxon. The mapping to NCBI enables hierarchical queries to be performed, which can improve TreeBASE information retrieval by an order of magnitude.
TBMap database provides a mapping of the bulk of the names in TreeBASE to names in external taxonomic databases, and a clustering of those mappings into sets of names that can be regarded as equivalent. This mapping enables queries and visualisations that cannot otherwise be constructed. A simple query interface to the mapping and names clusters is available at: http://linnaeus.zoology.gla.ac.uk/~rpage/tbmap

The TBMap web site needs some work, it's really only intended to document the mapping. Once I've tweaked and updated the mapping, I hope to use it in my forthcoming all-sining, all-dancing, phylogeny database...

Monday, May 14, 2007

More EoL commentary

Lucy Odling-Smee has a short piece on EoL in Nature (doi:10.1038/news070508-7), quoting a certain Page chap as saying

"If it's done well it could be fabulous"

Not the most insightful thing I've ever said. One of the issues Lucy's piece highlights is the long term sustainability of electronic resources like EoL. The whole issue of digital curation is worrying, given the transient nature of many electronic resources.

Friday, May 11, 2007

EoL commentary in Science

Mitch Leslie has written an article on EoL (doi:10.1126/science.316.5826.818). It starts:

Hands up if you've heard this before: An ambitious new project promises to create an online compendium of all 1.8 million or so described species. It can already claim participation by premier institutions, a wad of start-up cash, and huzzahs from biodiversity guru Edward O. Wilson. Although some confess to a wary sense of déjà vu, taxonomists hope that the Encyclopedia of Life (EOL) can provide the long-awaited comprehensive species catalog. Even enthusiasts agree that it faces some tall hurdles, however, such as signing up curators and getting permission to use copyrighted material.

Déjà vu because the defunct All-Species Foundation -- also covered in Science (doi:10.1126/science.294.5543.769) -- had much the same ambitions six years ago. It is easy to be sceptical, but I think it was Rudi Giuliani who said "under promise, over deliver." Wise words.

Thursday, May 10, 2007

David Shorthouse enters the blogsphere

David Shorthouse has entered the blogsphere with his iSpiders blog. As David descrbes it:

This blog will include bits that have fallen off the wagon as it were while developing The Canadian Arachnologist, The Nearctic Spider Database, The Nearctic Arachnologists' Forum and Spider WebWatch. The latter is a citizen science initiative that accepts observation data on 9 ambassador species in North America. I have a strong interest in federating biological data so there will undoubtedly be posts about nomenclatural management, species concepts, data aggregation techniques and the like.

Already some interesting commentary on EoL

ITIS and DOIs

Following on from my earlier grumble about how the catalogue of Life handles literature, I've spent an afternoon mapping publications in the "itis".publications table in a copy of ITIS to external GUIDs, such as DOIs, Handles, and SICIs in JSTOR. The mapping is not complete by any means, but gives an idea of how many publications have GUIDs.You can view the mapping here. Many of the publications in ITIS are books, which don't have DOIs. A lot of the literature is also old (although this doesn't always mean it won't have a DOI).

Of 4296 records, 324 have DOIs (around 7.5%). Not a lot, but a still a reasonable chunk. At least 700 of the ITIS publications are books (based on having an ISBN), so the percentage is a little higher.

The point of this exercise (following on from my comments on the design flaw in the catalogue of life), is that I think taxonomic databases need to use GUIDs internally to maximise their stability and utility.

Indeed, this is another reason to be disappointed with ZooBank. In addition to a poor way to navigate trees (which prompted me to explore tools such PygmyBrowse), ZooBank does exactly what ITIS and the Catalogue of Life do when it comes to displaying literature -- it displays a text citation (albeit with an invitation to view that record in Zoological Record, a subscription-based service).

For example, the copepod Nitocrellopsis texana was described in ITIS publication 3072, which I've discovered has the the DOI doi:10.1023/A:1003892200897. Given a DOI we have a GUID for the publication, and a direct link to it. In contrast, ZooBank merely gives us:

Nitocrellopsis texana n. sp. from central TX (U.S.A.) and N. ahaggarensis n. sp. from the central Algerian Sahara (Copepoda, Harpacticoida). Hydrobiologia 418 (1-3) 15 January: 82

and a link to Zoological Record. Interesting, even with the resources of ISI behind it, the Zoological Record result doesn't have the DOI.

This for me is one reason ZooBank was so disappointing, it actually provided little of value.

What next? Well, with the 300 or so references mapped to DOIs, one could link those to the ITIS records for the corresponding taxonomic names, and serve these up through somehting like iSpecies, for example. These would be links to the literature, in many cases original descriptions, to supplement the other literature found by iSpecies.

Wednesday, May 09, 2007

Catalogue of Life design flaw

A bit more browsing of the Catalogue of Life annual checklist for 2007 reveals a rather annoying feature that, I think, cripples the Catalogue's utility. With each release the checklist grows in size. From their web site:

The Species 2000 & ITIS Catalogue of Life is planned to become a comprehensive catalogue of all known species of organisms on Earth by the year 2011. Rapid progress has been made recently and this, the seventh edition of the Annual Checklist, contains 1,008,965 species.

However, with each release the identifiers for each taxon change. For example, if I were to link to the record for the peacrab Pinnotheres pisum this year (2007), I would link to record 3803555, but last year I would have linked to 872170. Record 872170 no longer exists in the 2007 edition.

So, what would a user who based their taxonomic database on the Catalogue of Life do? All their links would break (not just because the URL interface has changed, but the underlying identifiers have changed as well). It's as if the authors of the catalogue have been oblivious to the discussion on globally unique identifiers (GUIDs) and the need for stable, persistent identifiers.

Anybody building a database that gets updated, and possible rebuilt needs to thik about how their identifiers will change. If identifiers are simply the primary keys in a table, then they will likely be unstable, unless great care is taken. Althernatively, databases that are essentially aggregations of data available elsewhere could use GUIDs as the primary keys. This means that even if the database is restructured, the keys (and hence the identifiers) don't change. For the user, everything still works.

Despite the favourable press about its progress (e.g., doi:10.1038/news050314-6, Environmental Research Web, and CNN), I think the catalogue needs some serious rethinking if it is to be genuinely useful. For more on this, see my earlier posting on how the catalogue handles literature.

Image of Pinnotheres pisum by Hans Hillewaert obtained from Wikimedia Commons.

Encyclopedia of Life Launch

The Encyclopedia of Life web site is up, together with some rather breathless publicity and this cool movie. Of course, it's all vapourware just now. I'm involved in some of the informatics in an advisory role. It will be interesting to see what happens. Let's hope that the fate of EoL will be different to that of the similarly ambitious All Species. Oh, and then there's SpeciesBase...

For some reaction see Slashdot.

Tuesday, May 08, 2007

Duplicate DOIs

I think this isn't supposed to happen, but here's a paper with two DOIs.

The first DOI is doi:10.1651/0278-0372(1997)17[253:MPAOTC]2.0.CO;2, which links to a record served by BIOONE. The second is doi:10.2307/1549275, which links to a JSTOR record (sici:0278-0372(199705)17:2<253:MPAOTC>2.0.CO;2-O).

The paper is:

Morphology-Based Phylogenetic Analysis of the Clawed Lobsters (Family Nephropidae and the New Family Chilenophoberidae)
Dale Tshudy, Loren E. Babcock
Journal of Crustacean Biology, Vol. 17, No. 2 (May, 1997), pp. 253-263

Now, the digital versions served by BIOONE and JSTOR is different, in that BIOONE serves full text HTML and a PDF, whereas JSTOR serves scanned images, but this is the same article.

As an added "feature", the BIOONE DOI doesn't work, which in my experience is often the case with BIOONE DOIs.

Monday, May 07, 2007

Catalogue of Life, OpenURL, and taxonomic literature

Playing with the recently released "Catalogue of Life" CD, and pondering Charles Hussey's recent post to TAXACOM about the "European Virtual Library of Taxonomic Literature (E-ViTL)" (part of EDIT) has got me thinking more and more about how primitive our handling of taxonomic literature is, and how it cripples the utility of taxonomic databases such as the Catalogue of Life. For example, none of the literature listed in the Catalogue of Life is associated with any digital identifier (such as a DOI, Handle, SICI, or even a URL). In the digital age, this renders the literature section nearly useless -- a user has to search for the reference in Google. Surely we want identifiers, not (often poorly formed) bibliographic citations? For example, I think hdl:2246/4613 is more useful than

Schmidt, K. P. 1921. New species of North American lizards of the genera Holbrookia and Uta. American Museum Novitates (22)

Given the Handle hdl:2246/4613, we get straight to the bibliographic resource, and in this case, a PDF of the paper. In the digital age this is what we need.

So, how to get there? Well, I think we need to focus on developing services to associate references with identifiers. Imagine a service that takes a bibliographic record and returns a globally unique identifier for that reference. This, of course, is part of what CrossRef provides through its OpenURL resolver.

OpenURL has been around a while, and despite the fact that it is probably over complicated (see I hate library standards for more on the seeming desire of librarians to make things harder than they need to be), I think it is a useful way to think about the task of linking taxonomic names to literature, especially if we keep things simple (see Rethinking OpenURL). In particular, drop the obsession with local context -- I don't care what my local library has, my library is the cloud.

So, what if we had an OpenURL service that took a bibliographic citation and queried local and remote sources for a digital identifier, such as a DOI or a Handle, for that citation? If there is no such identifier, then the next step is to create one. For example, the service could create a SICI (see my del.icio.us bookmarks for sici) for that item. Ideally, for those items that were digitised, we could have a database that associated SICIs with the resource location. For example, most of the journal Psyche is available free online as PDFs, and has XML files for each volume providing full bibliographic details (including URLs). It would be trivial to harvest these and add this information to an OpenURL service.

These ideas need a little more fleshing out, but I think it's time the taxonomic community started thinking seriously about digital identifiers for literature, and how they would be used. CrossRef is a great example of what can be done with some simple services (Handles + OpenURL), and it's a tragedy that every time DOIs come up people get blinded by cost, and don't spend time trying to understand how CrossRef works. If nyou want a good demonstration of what can be done with CrossRef, just look at Connotea, which builds much of its functionality on top of CrossRef web services.

It is also interesting that CrossRef is much simpler to use than repositories such as DSpace (used by the AMNH's digital library) -- each DSpace installation has it's own hooks to retrieve metadata (in some cases, such as the AMNH, appallingly badly formed), and as a result there is no easy way to discover what metadata is associated with a given handle, nor given a citation whether a handle exists for that citation.

So, when projects such as EDIT start talking about taxonomic libraries, I think they need to think in terms of simple web services that will serve as the building blocks for other tools. An OpenURL service would be a major boon, and would speed us towards the day when databases such as the Catalogue of Life would not contain (often inconsistently formed) text records of bibliographic works, but actionable identifiers. Any thing less and we remain in the dark ages.

Wednesday, May 02, 2007

Google, the Ken Burns Effect, and Fractal Cognitive Engagement

From One Big Library, a nice screencast of why Google's interface works, in contrast to the clunky interfaces favoured by libraries (and databases such as TreeBASE). (If you don't see the video here, visit http://www.youtube.com/watch?v=ijLDxgALc2c.)