Wednesday, November 11, 2009

Tag trees: displaying the taxonomy of names in BHL

I've added a feature to my Biodiversity Heritage Library viewer that should help make sense of the names found on a page. Until now I've displayed them as a list of "tags", which ignores the relations among the names. Based on some code I'd developed for my e-Biosphere 09 challenge entry I've added a "tag tree" that displays the classification of the names found on a BHL page:


The idea is that a set of names can make much more sense if you know what kind of organism they are referring to. For example, I don't know what Onetes is, but if I look at BHL page 2298380 I can see that it's an insect:


The names in gray don't occur on the page, but do occur in the tree that links those names (the latter are highlighed in black). The tag tree can be useful for separating out host and parasite, e.g. BHL page 2298491 is about a flea and it's mammalian hosts:


The tag tree can also flag names that might be mistaken, such as those found on page 2298330:


This page has names of some grasshoppers from Madagascar, as well as the name of a butterfly (Tsaratanana), which seems a little odd. Looking at the text, we discover that "Tsaratanana" is Mont. Tsaratanana a mountain in Madagascar. It would be fun to develop tools to annotate such cases so that somebody looking for the butterfly won't be presented with this page.

How it works

The inspiration for this tag tree came from several sources. David Remsen has often used an example of finding a fly name in the middle of a book on birds as being of interest, and the NCBI have a subtree view of taxa in a PubMed article. My own tag tree is constructed by finding for each name the ancestor-descendant path in a local, modified copy of the Catalogue of Life database, then assembling those paths into a tree. Because not all the names on a BHL page are in the Catalogue of Life, there may be names that aren't classified. These are simply listed below the tag tree (see image above).

Monday, November 09, 2009

iTaxon screencast

Sadly I won't be at TDWG 2009, at least not in person. However, there is a session on wikis, which may contain this brief screencast of my iTaxon experiments. The screencast was made in haste, but tries to convey some of the ideas behind these experiments, especially the idea that by linking data together we can generate more interesting and rich views of objects such as scientific publications. The screencast starts with the The amphibian tree of life page.

Thursday, November 05, 2009

BHL Viewer now with go faster stripes

One of the more glaring limitations of my BHL viewer described in the previous post is that it can take a while to load all the page thumbnails (there can be hundreds). Given that one of the original motivations for this project was a faster viewer, this kinda sucks. What I'd like to do is load the thumbnails only when I need them, rather than all at once at the start -- in other words I'd like to implement lazy loading.

I'm using the Prototype Javascript library, and to my delight Bram Van Damme has written lazierLoad, inspired by Lazy Load for JQuery. lazierLoad works by attaching a listener to each image that listens for scroll events -- when the browser window scrolls each image receives a notification event and works out whether it needs to load the image. In theory, all you do is add the lazierLoad Javascript to your page, and only images that are currently visible will be fetched from the server. I say "in theory" because I needed to tweak the script a little because the thumbnails are inside a DIV element that has it's own scrollbar (thanks to the CSS style overflow:auto). Hence I needed to add the listener to this DIV, and compute coordinates for the image taking the DIV into account. Like most things, easy once you know how (translation, after numerous failed attempts, and the occasional "doh!" it seems to work).

You can see lazy loading in action if you view a BHL item, such as Item 26140. Note that this implementation of lazy loading doesn't work in Safari, much to my chagrin (it's my default browser). It works fine in Firefox

Tuesday, November 03, 2009

Biodiversity Heritage Library viewer experiments

In between the chaos that is term-time I've been playing with ways to view Biodiversity Heritage Library content. The viewer is crude, and likely to go off-line at any moment while I fuss with it, the you can view an example here. This link takes you to a display of BHL item 19513, which is volume 12 of the Bulletin of the British Museum (Natural History) Entomology, which includes some striking insects, such as the species of Lopheuthymia displayed below:


The viewer attempts to do several things:
  1. Display a BHL document in a way that I can quickly scan pages looking for article boundaries by showing thumbnails
  2. Provide a simple way for me to edit metadata (if you find the start of an article you can click on the is the first page in an article and edit the article details)
  3. Provide a RIS dump of the articles which can then be uploaded into other bibliographic tools

Note that on the left hand side I'm displaying the articles that have already been found in the volume. The editing interface is crude, and I'll need to look at user authentication and versioning to do this seriously, but it seems a quick way to annotate BHL content. Much of this needs only be done once, as once we have article boundaries then searching BHL journal content using, say, OpenURL becomes easy, and we can link bibliographic records in nomenclators to BHL. Improving the metadata will also improve visualisation such as my BHL timeline.

I hope to play a bit more with the view over the next days and weeks. It's pretty simple (Javascript with PHP back end). The key to creating the viewer was a complete dump of BHL's page metadata kindly provided by Mike Lichtenberg. I use this to locate individual page images stored by the Internet Archive, which I then store locally (and generate 100 pixel wide thumbnails).

Potentially there's a lot more one could add to a tool like this. I'm playing with displaying the taxon names found by uBio so that I can flag instances where the page is where the name is first published. One could imagine adding other flags, such as when a taxon is depicted so that we could easily extract images of taxa from BHL content. It would also be nice to be able to add taxon names that uBio's algorithms have missed. Utlimately, one could even display the OCR text and correct/annotate that. Much to do...