Friday, April 15, 2011

BHL, DjVu, and reading the f*cking manual

One of the many biggest challenges I've faced with the BioStor project, apart from dealing with messy metadata, has been handling page images. At present I get these from the Biodiversity Heritage Library. They are big (typically 1 Mb in size), and have the caramel colour of old paper. Nothing fills up a server quicker than thousands of images.

A while ago started playing with ImageMagick to resize the images, making them smaller, as well as ways to remove the background colour, leaving just black text and lines on white background.

Before and after converting BHL image

I think this makes the page image clearer, as well as removing the impression that this is some ancient document, rather than a scientific article. Yes, it's the Biodiversity Heritage Library, but the whole point of the taxonomic literature is that it lasts forever. Why not make it look as fresh as when it was first printed?

Working out how to best remove the background colour takes some effort, and running ImageMagick on every image that's downloaded starts putting a lot of stress on the poor little Mac Mini that powers BioStor.

Then there's the issue of having an iPad viewer for BHL, and making it interactive. So, I started looking at the DjVu files generated by the Internet Archive, and thinking whether it would make more sense to download those and extract images from them, rather than go via the BHL API. I'll need the DjVu files for the text layout anyway (see Towards an interactive DjVu file viewer for the BHL).

I couldn't remember the command to extract images from DjVu, but I did remember that Google is my friend, which led me to this question on Stack Overflow: Using the DjVu tools to for background / foreground seperation?.

OMG! DjVu tools can remove the background? A quick look at the documentation confirmed it. So I did a quick test. The page on the left is the default page image, the page on the right was extracted using ddjvu with the option -mode=foreground.


Much, much nicer. But why didn't I know this? Why did I waste time playing with ImageMagick when it's a trivial option in a DjVu tool? And why does BHL serve the discoloured page images when it could serve crisp, clean versions?

So, I felt like an idiot. But the other good thing that's come out of this is that I've taken a closer look at the Internet Archive's BHL-related content, and I'm beginning to think that perhaps the more efficient way to build something like BioStor is not through downloading BHL data and using their API, but by going directly to the Internet Archive and downloading the DjVu and associated files. Maybe it's time to rethink everything about how BioStor is built...