Thursday, October 25, 2007

BHL and DOis

In a series of emails Chris Freeland, David Shorthouse, and I have been discussing DOIs in the context of the Biodiversity Heritage Library (BHL). I thought it worthwhile to capture some thoughts here.
In an email Chris wrote:
Sure, DOIs have been around for a while, but how many nomenclators or species databases record them? Few, from what I've seen - instead they record citations in traditional text form. I'm trying to find the middle ground between guys like the two of you, who want machine-readable lit (RDF), and most everyone else I talk with, including regular users of Botanicus & BHL, who want human-readable lit (PDF). I'm not overstating - it really does break down into these 2 camps (for now), with much more weight over on the PDF side (again, for now).

I think the perception that there are two "camps" is unfortunate. I guess for a working taxonomist, it would be great if for a given taxonomic name there was a way to see the original publication of the name, even if it is simply a bitmap image (such as a JPEG). Hence, a database that links names to images of text would be a useful resource. If this is what BHL is aiming for, then I agree, DOIs may seem to be of little use, apart from being one way to address the issue of persistent identifiers.

But it seems to me that there are lots of tasks for which DOIs (or more precisely, the infrastructure underlying them) can help. For example, given a bibliographic citation such as

Fiers, F. and T. M. Iliffe (2000) Nitocrellopsis texana n. sp. from central TX (U.S.A.) and N. ahaggarensis n. sp. from the central Algerian Sahara (Copepoda, Harpacticoida). Hydrobiologia, 418:81-97.

how do I find a digital version of this article? Given this citation

Fiers, F. & T. M. Iliffe (2000). Hydrobiologia, 418:81.

how do I decide that this is the same article? If I want to see whether somebody has cited this paper (and perhaps changed the name of the copepod) how do I do that? If I want follow up the references in this paper, how do I do that?

These are the kinds of thing that DOIs address. This article has the DOI doi:10.1023/A:1003892200897. This gives me a globally unique identifier for the article. The DOI foundation provides a resolver whereby I can go to a site that will provide me with access (albeit possibly for a fee) to the article. CrossRef provides an OpenURL service whereby I can
  • Retrieve metadata about the article given the DOI

  • Given metadata I can search for a DOI

To an end user much of this is irrelevant, but to people building the links between taxonomic names and taxonomic literature, these are pressing issues. Previously I've given some examples before where taxonomic databases such as Cataloggue of Life and ITIS store only text citations, not identifiers (such as DOIs or Handles). As a result, the user has to search for each paper "by hand". Surely in an ideal world there would be a link to the publication? If so, how do we get there? How do IPNI, Index Fungorum, ITIS, Sp2000, ZooBank, and so on link their names and references to digitised content? This is where a CrossRef-style infrastructure comes in.

Publishers "get this". Given the nature of the web where users expect to be able follow links, CrossRef deals with the issue of converting the literature cited section of a paper into a set of clickable links. Don't we want the same thing for our databases of taxonomic names? And, don't we want this for our taxonomic literature?

It is worth noting that the perception that DOIs only cover modern literature is erroneous. For example, here's the description of Megalania prisca Owen (doi:10.1098/rstl.1859.0002), which was published in 1859. The Royal Society of London has DOIs for articles published in the 18th century.

If the Royal Society can do this, why can't BHL?


Chris Freeland said...

Rod - I'd like to clarify some points. BHL is not against DOIs - far from it. We recognize the importance of them, but the cost for assigning them is not currently included in any budget line item.

Problems are also presented in the actual assignment: 1) Determining article boundaries within a scanned volume, and 2) Determining article title and authors. BHL is working on a project with PennState to develop automated processes that address these problems, but without full rekeying (again, not in our current budget), the best metadata we can pull from is dirty OCR. This is not going to be sufficient for assigning a DOI at CrossRef, so we would need human intervention to clean up text before submission. Even more expensive.

It seems the services CrossRef provide are what entices you. I'm not suggesting that BHL should reinvent these, but aren't you concerned about building your services on top of other proprietary services? It's the closed and fee-based aspects of DOIs that seem to sit least favorably within the scientific community. As I've said before, if this were open and free it would be a moot point. But it's not, and that's what gives me pause.

Rod Page said...

Regarding the services, this is indeed what I find enticing. It's the services built on top of DOIs and metadata that make CrossRef so useful (both to users and to publishers).

Closed and fee-based can be problematic, as the recent scare over capped queries to CrossRef's OpenURL resolver showed (this issue has since been resolved). However, one could argue that it's because there is money invested in CrossRef that it succeeds -- there's no such thing as a free lunch. In the end the only services that persist are likely to be those that people value in (either directly or indirectly) financial terms.

I'm not aware of the community objecting to DOIs because it's closed, the issue always seems to be money. Yes, CrossRef is closed. We could try and create our own version along the lines of citebase. The tools are pretty much in place.

So, if I step back from DOIs for a moment, what I really want are the services that CrossRef offers. I suspect the best way to do this (in terms of integrating with existing literature) is to use DOIs, but I can live without them if we have the equivalent services.

Lastly, is anybody at BHL talking to JSTOR? They seem to have dealt with a lot of these issues (scanning articles, metadata, DOIs, etc.). Indeed, one wonders whether BHL should contract our the scanning to JSTOR...

Chris Freeland said...

BHL has talked with JSTOR, but it's another closed, fee-based system. Great if your institution participates, but effectively shuts off content to everyone else, which goes against our goal to make public domain materials available without restriction.

Rod Page said...

I was meaning more in terms of using their technology, but making the product openly available. In other words, not asking them to serve the content under their model, but getting them to do actual work.

David Shorthouse said...

Let's not forget that CrossRef is not-for-profit. There are fees because there is overhead involved in maintaining GUIDs like DOIs. Fees, though initially seen as distasteful, are exactly why DOIs work as Rod suggested. I see these fees as tokens of confidence and I'd like to see the same approach be used for museum specimens (Argument and creative approach outlined here).

Chris Freeland said...

Another obstacle is the requirement to have an ISSN when assigning a DOI - as you've previously discussed, many historic titles have no ISSN or ISBN. We've been in several meetings with the ISBN registration agency, and have started digging into the issues surrounding the assignment of ISBNs to historic monographs, but ISSNs are handled by a different agency. The ISBN issue got really ugly, really quickly, by the way, when trying to determine publisher, contributors, distributors, etc. It's an evolving model, and projects like BHL are pushing the boundaries.

Rod Page said...

Do you need ISSN's to get a DOI for a journal article? My reading of the CrossRef spec is that the ISSN is mandatory if it exists. If it doesn't, I'm not sure this precludes getting a DOI. I guess somebody from CrossRef would be able to clarify this.

I'm less worried about ISBNs, but that may be simply my ignorance about books. I've devoted more energy to ISSNs because of their possible role in generating identifiers for articles.

Ed Pentz said...

Hi - an ISSN *is* required for a CrossRef journal article DOI deposit. ISSN are regularly retrospectively assigned to older serials when they get digitized. There is no charge for ISSN. In addition you don't need to know who the original publisher was to assign an ISSN - seems more flexible than ISBN.

DOIs might not be right for BHL but David got it exactly right when he said that CrossRef fees are "tokens of confidence" that ensure persistence.

In terms of CrossRef and DOI being "closed" - if charging fees means we're closed, I guess we are, but we have a pretty long list of participating "publishers"

Rod Page said...

Ed - Thanks for clarify the requirement for an ISSN. Given that they are free to assign, I'm all in favour of using them. It also helps generate unique identifiers for journal articles, regardless of whether DOIs end up being the identifier or not.