Friday, April 11, 2025

Future interfaces for the Biodiversity Heritage Library

On Wednesday this week (April 9th, 2025) I gave a talk entitled “Future interface(s) for BHL” (the slides are on FigShare) at BHL Day 2025. My goal was to introduce “BHL-Light”, an exploration of an alternative interface to the Biodiversity Heritage Library (BHL). As some readers may already know, BHL is coming to a crossroads, and so this presentation felt a bit more urgent than my usual “here’s yet another web site I made”.

BHL-Light

BHL-Light is my attempt to explore other ways of navigating BHL. The current interface is somewhat dated, and I wanted to start from scratch and see what might be possible to create, even for someone with my somewhat limited skills. BHL-Light has only a very small subset of BHL’s content, I’m putting scalability issues to one side so that I can have some fun.

The tech (TL;DR BHL was not harmed in the making of this)

Under the hood, BHL-Light stores BHL metadata, OCR text, and layout information as JSON documents in CouchDB (one of my favourite databases for exploring new ideas).

BHL serves its images from Internet Archive, which is not always available. BHL recently uploaded images to AWS, but the images there are not currently viewable on the web. So I ended up creating my own image server. I used Hetzner’s S3-compatible object storage for the image files, added imgproxy to resize images as needed, and finally put all this behind a Cloudflare CDN to speed up image delivery (and reduce traffic to the S3 store, which becomes a real consideration when one is paying for all of this).

To view BHL content (e.g., books, journal volumes) I wrote my own viewer, modelled loosely on Google Books. I expressly wanted to avoid IIIF because I find IIIF viewers a terrible way to view documents, and for me BHL is all about the text.

The web site itself is a few PHP scripts to glue everything together, and I’ve tried to avoid using Javascript unless absolutely necessary. HTML + CSS is really powerful these days, so you can do a lot without resorting to Javascript.

Tour

In building BHL-Light I’ve wanted a simple interface to concentrate on displaying content as much as possible. I also wanted a cleaner interface, one that is responsive (AKA "mobile friendly").

BHL has some extraordinary content. It has works both old and new.

Text

The viewer I built makes it easy to scroll through an item, and also makes text selectable (something you currently can’t do in BHL. This means you can interact with text in the browser, such as using Google Chrome to translate part of the text.

It also opens up the possibility of annotation using Hypothes.is.

Geotagging and maps

I also demonstrated pages that had been geotagged. These tags can be extracted and used to create an interactive map.

I still haven’t decided on the best way to interact with the map. For example, should we use the map to search for content geographically, or should we search for content and display those results on a map, or both? I ran out of time to resolve this, so for now if you click on the map you see a H3 hexagon that encloses where you click. The idea is that then the page would display BHL content within that area. Other idea ideas include something like Frankenplace or JournalMap.

Document layout

For me one of the most exciting areas for the future is adding document layout information to BHL content, such that not only can we identify articles, but figures, tables, references, etc. In this way BHL could finally offer something akin to what Plazi can deliver: structured text about species. This has seemed a challenging task, but recent AI developments have been a game changer. In particular, Datalab have released powerful and simple-to-use tools that do a very good job of retrieving document structure from scanned pages. I have started to use this on BHL content and display the results on BHL-Light. For example, Datalab makes it almost trivial to identify and extract figures from scanned pages. Below is a comparsion of document layout for a page as inferred from a born-digital PDF by Plazi, and the same page in BHL where it is simply an image, but Datalab's methods have inferred which bits are text, figures, captions, etc.

One unexpected consequence of building my own image server (see above) is that the task of displaying figures by cropping page images becomes almost trivial. This idea was inspired in part by Smits et al.’s approach of cropping Internet Archive images.

What’s next?

There is much to do. BHL-Light is missing many features. It doesn’t make it easy to find content such as articles found by BioStor, the project I started over a decade ago to find articles in BHL. Search is rudimentary at best, and I haven’t tackled taxonomic names yet (but have ideas for this).

For me BHL-Light is a fun way to explore BHL, and its development has made me even more aware of all the work done to create the current and maintain the BHL portal. Apart from being a play thing for me, I am curious as to whether BHL-Light might be a way to have “BHL-mini” portals, rather like GBIF hosted portals. In this way, we could have views of BHL focused on a particular taxon, institution, person, etc., or localised by language and/or country. Perhaps we could de-extinct past projects such as BHL-Europe?

References

Page, R.D. Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics 12, 187 (2011). https://doi.org/10.1186/1471-2105-12-187

Page, Roderic (2025). Future interface(s) for BHL. figshare. Presentation. https://doi.org/10.6084/m9.figshare.28777868.v1

Smits, T., Warner, B., Fyfe, P., & Lee, B. C. G. (2025). A Fully-Searchable Multimodal Dataset of the Illustrated London News, 1842–1890. Journal of Open Humanities Data, 11(1), 10. https://doi.org/10.5334/johd.284

Written with StackEdit.

Wednesday, February 26, 2025

BOLD View: exploring DNA barcodes

For a while now I’ve been exploring ways to navigate through DNA barcodes. Over the years I’ve built various “toys” to explore barcodes, such as Displaying a million DNA barcodes on Google Maps using CouchDB, built a small scale browser using Elastic search that had some succes, and discovered that Postgres can search for DNA sequences and it’s really fast. At the same time, I’ve bemoaned the challenges of getting barcode data into GBIF, and the current state of BOLD’s data exports.

Over the last few months I’ve been getting a project to the point where it’s usable, and today I’ve released a live version called BOLD view. Why make a portal to DNA barcodes when BOLD have themselves recently released a new version of their own portal you might ask? There are two reasons. Making my own forces me to explore the barcode data in some detail, which is eye-opening in places. The second reason is that I want to be able to explore the barcode data at various levels and in different ways. For example, I want an interactive global map of barcodes.

I want to see a DNA barcode in context, including a phylogeny that includes barcodes both within and outside the BIN the barcode belongs too.

I want to make the imagery more visible.

I want to be able to navigate the taxonomy underlying the barcodes using tools such as summary trees.

I want to be able to input a DNA search and quickly search for matches.

I also want to be able to connect the barcodes to the science behind them (who created the barcodes and what questions were they addressing?).

Abve all, I just want to be able to explore the data. I don’t want donut charts and dashboards. I want to be able to see the data and the connections. There is still much to be done, in particular I want to visualise sequence alignments. We can have a global map, and a global taxonomy, where is the global alignment?

I hope to work on BOLD view further, but for now it is out the door and my spotlight will inevitably turn elsewhere.

Written with StackEdit.