Saturday, October 24, 2020

Visualising article coverage in the Biodiversity Heritage Library

It's funny how some images stick in the mind. A few years ago Chris Freeland (@chrisfreeland), then working for Biodiversity Heritage Library (BHL), created a visualisation of BHL content relevant to the African continent. It's a nice example of small multiples.

For more than a decade (gulp) I've been extracting articles from the BHL and storing them in BioStor. My approach is to locate articles based on metadata (e.g., information on title, volume, and pagination) and store the corresponding set of BHL pages in BioStor. BHL in turn regularly harvests this information and displays these articles as "parts" on their web site. Most of this process is semi-automated, but still requires a lot of manual work. One thing I've struggled with is getting a clear sense of how much progress has been made, and how much remains to be done. This has become more pressing given work I'm doing with Nicole Kearney (@nicolekearney) on Australian content. Nicole has a team of volunteers feeding me lists of article metadata for journals relevant to Australian biodiversity, and it would be nice to see where the remaining gaps are.

So, motivated by this, and also a recent comment by Nicky Nicolson (@nickynicolson) about the lack of such visualisations, I've put together a crude tool to try and capture the current state of coverage. You can see the results here:

As an example, here is v.25:pt.2 (1988) of Memoirs of the Queensland Museum. Each contiguous block of colour highlights an article in this volume:

This scanned item-level view is constructed for each item (typically a volume or an issue). I then generate a PNG bitmap thumbnail of this display for each volume, and display them together in a page for the corresponding journal (e.g., Memoirs of the Queensland Museum):

So at a glance we can see the coverage for a journal. Gray represents pages that have not been assigned to an article, so if you want to add articles to BHL those are the volumes to focus on.

There's an argument to be made that it is crazy to spend a decade extracting some 226,000 articles semi-automatically. Ideally we could use a tool like machine learning to identify articles in BHL. It would be a huge time saver if we could simply run a BHL article through a tool that could (a) extract article-level metadata and (b) associate that with the corresponding set of scanned pages. Among the possible approaches would be to develop a classifier that would assign each page in a scanned volume to a category such as "article start", "article end", "article", "plate", etc. In effect, we want a tool that can could segment scans into articles (and hence reproduce the coverage diagrams shown above) simply based on attributes of the page images. This doesn't solve the entire problem, we still need to extract metadata (e.g., article titles), but it would be a start. However, it poses the classic dilemma, do I keep doing this manually, or do I stop adding articles and take the time to learn a new technology in the hope that eventually I will end up adding more articles than if I'd persisted with the manual approach?