Wednesday, April 16, 2014

Breaking the Biodiversity Heritage Library

The Biodiversity Heritage Library (BHL) has recently introduced a feature that I strongly dislike. The post describing this feature (Inspiring discovery through free access to biodiversity knowledge... states:

Now BHL is expanding the data model for its portal to be able to accommodate references to content in other well-known repositories. This is highly beneficial to end users as it allows them to search for articles, alongside books and journals, within a single search interface instead of having to search each of these siloes separately.

What this means is that, whereas in the past a search in BHL would only turn up content actually in BHL, now that search may return results from other sources. What's not to like? Well, for me this breaks the fundamental BHL experience that I've come to rely on, namely:

If I find something in BHL I can read it there and then

With the new feature, the search results may include links to other sources. Sometimes these are useful, but sometimes they are anything but. Once you start including external links in your search results, you have limited control over what those links point to. For example, if I search BHL for the journal Revista Chilena de Historia Natural I get two hits. Cool! So I click on one hit and I can read a fairly limited set of scanned volumes in BHL, if I click on the other hit I'm taken to a page at the Digital Library of the Real Jardín Botánico of Madrid. This is a great resource, but the experience is a little jarring. Worse, for this journal the Real Jardín Botánico doesn't actually have any content, instead the "View Book" link takes me to SciElo in Chile, where I can see a list of recent volumes of this journal.

In this case, BHL is basically a link farm that doesn't give me direct access to content, but instead sends me on a series of hops around the Internet until I find something (and I could have gotten there more quickly via Google).

What is wrong with this?


There are two reasons I dislike what BHL have done. The first is that it breaks the experience of search then read within a consistent user interface. Now I am presented with different reading experiences, or, indeed, no reading at all, just links to where I might find something to read.

More subtlety, it undermines a nice feature of BHL, namely searching by taxonomic names. The content BHL has scanned has also been indexed by taxonomic name, so often I find what I'm looking for not by using bibliographic details (journal name, volume, etc.), which are often a bit messy, but by searching on a name. External content has not been indexed by name, so it can't be found in this way. Whereas before, if I search by name I would be reasonably confident that if BHL had something on that name I could find it (barring OCR errors), now BHL may well have what I looking for (in an external source) but can't show me that because it hasn't been indexed.

From my perpsective, the things I've come to rely on have been broken by this new feature (and I haven't even begun to talk about how this breaks things I rely on to harvest BHL for article metadata, which I then put into BioStor, which in turn gets fed back into BHL).

What should BHL have done?


To be clear, I'm not arguing against BHL being "able to accommodate references to content in other well-known repositories". Indeed, I'd wish they'd go further and incorporate content from BHL-Europe, whose portal is, frankly, a mess. Rather, my argument is that they should not have done this within the existing BHL portal. Doing so dilutes the fundamental experience of that portal ("if I find it I can read it").

Here's what I would do instead:
  1. Keep the current BHL portal as it was, with only content actually scanned and indexed by BHL.
  2. Create a new site that indexes all relevant content (e.g., BHL, BHL-Europe, and other repositories.
  3. Model this new portal on something like CrossRef's wonderful metadata search. That is, throw all the metadata into a NoSQL database, add a decent search engine, and provide users with a simple, fast tool.
  4. The portal should clearly distinguish hits that are to BHL content (e.g. by showing thumbnails) and hits that are to external links (and please filter links to links!).
  5. Add taxonomic names to the index (you have these for BHL content, adding them for external content is pretty easy).
  6. Even more useful, start indexing full text content, maybe starting at articles ("parts"). At the moment Google is doing a better job of indexing BHL content (indirectly via indexing Internet Archive) than BHL does.

Creating a new tool would also give BHL the freedom to explore some new approaches without annoying users like me who have come to rely on the currently portal working in a certain way. Otherwise BHL risks "feature creep", however well motivated.