Wednesday, January 13, 2010

BioStor so far

My BioStor project has reached over 13,000 articles, making it a sizeable respository of open access articles on biodiversity. It's still a tiny fraction of what could be extracted from the Biodiversity Heritage Library (BHL), but perhaps it's worth taking stock of what's there.


One pleasing discovery is that, despite the 1923 cut-off due to U.S. copyright, BHL contains a lot of post-1923 articles. Indeed, a sparkline of the number of articles over time shows that the bulk of the articles I've extracted from BHL are from the second half of the twentieth Century. These include journals such as Entomological News, and major herbaria and museum publications, such as Annals of the Missouri Botanical Garden, Breviora, Bulletin of the British Museum (Natural History). Zoology, and Proceedings of the United States National Museum.

This distribution partly reflects my own biased harvesting of BHL, but perhaps it also reflects a transition in publishing (from books and monographs to journals).


The interface to BioStor is still pretty rough, but one thing I've had fun adding is the SIMILE Exhibit widget to the author pages. Exhibit enables faceted browsing, so you can filter an author's list of publications by the journal they published, what year the article was published, and who was a coauthor.


One use I've found is that you can quickly filter out purely nomenclatural papers by select an author's publications in the Bulletin of Zoological Nomenclature.


I'm a fan of sparklines, and I'm a starting to add some to various pages. For example, you can now see a graph of how many papers an author has published over time, the example below is for the prolific Charles P Alexander:


This pretty much sums up search in BioStor at the moment:


The rate of growth of BioStor has been pretty steady, partly because I keep managing to find (or are alerted to, or are given) sources of bibliographic metadata. I'm also adding individual papers as I come across them, either using the Reference finder form directly, or by using OpenURL from within Endnote. But there's still a crying need for large bibliographies that can be harvested.


Individual references can be exported in a range of formats, and there is a RSS feed, but I need to add bulk export (including the ability to dump the whole database).


The viewer I developed earlier is working reasonably well, but could do with sprucing up. I'm also toying with using other viewers as well as (or instead of) mine. One possibility is iPaper (although this is Flash-based, and I'd like to avoid Flash wherever possible).

Taxonomic names

One thing I've discovered to my horror is just how common it is for the BHL taxon finder to miss obvious taxonomic names. I need to investigate this a bit more, but this is potentially a major issue as taxonomic indexing is one of the primary ways people will find content in BioStor


There's still so much to do. For one thing, the visualisations that originally got me seriously thinking about playing with BHL aren't in BioStor yet. And there's a long to-do list.