Tuesday, July 02, 2024

A future for the Biodiversity Heritage Library

Following the 2024 BHL meeting, and the departure of Martin Kalfatovic and the uncertainty the departure of such a pivitol person brings, perhaps it’s time to think about the future of BHL. Below I sketch some thoughts, which are hazy at best. I should say at the outset that I think BHL is an extraordinary project. My goal is to think about ways to enhance its utility and impact.

Three facets

I think BHL, in common with other projects such as GBIF, has three main facets: providers, users, and developers. These communities have different needs, and what works for one community need not work for the others.


Any project that mobilises data depends on people and organisations that have that data being willing to share it. That community needs a rationale for sharing, tools to share, and a means to demonstrate the value of sharing. The few BHL meetings I’ve been to have been dominated by libraries (it is a library project, after all). BHL meetings typically feature a tour of physical libraries where we gaze at ancient books, many of which are now accessible via the BHL website. There is value in being a member of a club that shares similar goals (making biodiversity literature accessible to a wider audience). From my perspective, a lot of BHL effort and infrastructure is focussed on libraries and library-related tasks. This is natural given its origins, but this means other aspects have been neglected.

Users (readers and more)

BHL users are likely diverse, and range from people like me who want the “hard core” technical literature (e.g., species descriptions) to people who revel in the wealth of imagery available in BHL (AKA “the pretty”) (see the BHL Flickr pages).

The current BHL portal provides a way for people to browse the scanned content, but feels designed primarily for librarians. It is organised by title and scanned volumes, hence it is driven by bibliographic metadata. For a long time, it didn’t support the notion of an “article”, which is why I ended up building BioStor to extract and display individual articles (the unit most academics work with). BHL is now actively adding articles and minting DOIs for articles, which helps embed its content in the wider scholarly landscape. To date these new DOI have been cited 56,000 times.

But the current BHL interface is not ideal for viewing articles. We need something simpler and cleaner, and more like the experience offered by modern journal websites.

Developers and data wranglers

I’m lumping developers and data wranglers together, even though these people may have different goals, they share the desire to get past the web interface to the underlying data. BHL has some great APIs that I and others make extensive use of. But this is different from providing a clean interface to the data. BHL has a wealth of information linked to taxonomic names, people, places, and more. Taxonomic indexing by Global Names has made BHL content much more findable, but there is huge scope for indexing on other features. For example, BioStor extracts latitude and longitude pairs from BHL text. These are shown on the map below, indicating the scope for geographic search in BHL.

What’s next?

I think there’s a case to be made to provide three separate interfaces to BHL.

The first would be for the providers (e.g., libraries), which includes all the behind the scenes infrastructure to do with cataloging, etc., and would also include the current portal. The existing BHL interface is important both to show the complete corpus, and also as a place for serendipitous discovery.

The second interface would be for readers. The obvious candidate here is Open Journal Systems (OJS) which powers many journal sites, including Zootaxa, by far the largest taxonomic journal. Indeed I would argue that BHL should adopt OJS and offer it as a service to existing biodiversity journals that may be struggling to manage their existing publishing. Taxonomic publishing has a very long tail of small journals, as the figure below shows (taken from DNA barcoding and taxonomy: dark taxa and dark texts).

This long tail is often hosted on all manner of custom web sites including Word Press blogs, none of which are ideal. There is an opportunity here for BHL to offer hosting as a, for example, an affordable service, using the same OJS infrastructure it would use to display BHL articles.

The final interface would be a data portal. The goal here is to enable people to retrieve data in ways that they find useful, for example by taxon, geographic location, etc. In an ideal world this might be a knowledge graph, but the gap between what knowledge graphs promise and what they deliver is still significant. As a first pass, probably the way forward is to define a series of simple data objects in JSON, load these into Elasticsearch and provide an API on top. This is essentially what GBIF does, where the data is in Darwin Core and the queries are searches over that data. This same infrastructure could also power searches over the articles in OJS, so that users could easily find the content they want.

This is all pretty arm-wavy at this point, but I think BHL needs to be more outwards facing than it currently is, and needs to think how best to serve the biodiversity community (many of which are already huge fans of BHL), as well as think of ways to enhance its long term sustainability.

Written with StackEdit.