Monday, December 20, 2010

BioStor one year on: has it been a success?

One year ago I released BioStor, which scratched my itch regarding finding articles in the Biodiversity Heritage Library. This anniversary seems to be a good time to think about where next with this project, but also to ask whether it's been successful. Of course, this rather hinges on what I mean by "success." I've certainly found BioStor to be useful, both the experience of developing it, and actually using it. But it's time to be a little more hard-headed and look at some stats. So I'm going to share the Google Analytics stats for BioStor. Below is the report for Dec 20, 2009 to Dec 19, 2010, as a PDF.


BioStor had 63,824 visits over the year, and 197,076 pageviews. After an initial flurry of visits on its launch the number of visitors dropped off, then slowly grew. Numbers dipped during the middle of the year, then started to climb again.

In order to discover whether these numbers are a little or a lot, it would be helpful to compare them with data from other biodiversity sites. Unfortunately, nobody seems to be making this information readily available. There is a slide in a BHL presentation that shows BHL having had more than 1 million visits since January 2008, and in March 2010 it was receiving around 3000 visits per day, which is an order of magnitude greater than the traffic BioStor is currently getting. For another comparison, I looked at Scratchpads, which currently comprise 193 sites. In November 2007 Scratchpads had 43,379 pageviews altogether, in November 2010 BioStor had 17,484 page views. For the period May-October 2009 Scratchpads had 74,109 visitors, for the equivalent period in 2010 BioStor had 28,110. So, BioStor is getting about a third of the traffic as the entire Scratchpad project.

Bounce rate

One of the more interesting charts is "Bounce rate", defined by Google as

Bounce rate is the percentage of single-page visits or visits in which the person left your site from the entrance (landing) page.
The bounce rate for BioStor is pretty constant at around 65%, except for two periods in March and June, when it plummeted to around 20%. This corresponds to when I set up a Wikisource installation for BioStor so that the OCR text from BHL could be corrected. Mark Holder ran a student project that used the BioStor wiki, so I'm assuming that the drop in bounce rate reflects Mark's students spending time on the wiki. BHL OCR text would benefit from cleaning, but I'm not sure Wikisources is the way to do it as it feels a little clunky. Ideally I'd like to build upon the interactive DjVu experiments to develop a user-friendly way to edit the underlying OCR text.

Is it just my itch?
Every good work of software starts by scratching a developer's personal itch - Eric S. Raymond, The Cathedral and the Bazaar

Looking at traffic by city, Glasgow (where I'm based) is the single largest source of traffic. This is hardly surprising, given that I wrote BioStor to solve a problem I was interested in, and the bulk of its content has been added by me using various scripts. This raises the possibility that BioStor has an active user community of *cough* one. However, looking at traffic by country, the UK is prominent (due to traffic primarily from Glasgow and London), but more visits come from the US. It seems I didn't end up making this site just for me.

map.pngGoogle search
Another measure of success is Google search rankings, which I've used elsewhere to compare the impact of Wikipedia and EOL pages. As a quick experiment I Googled the top ten journals in BioStor and recorded where in the search results BioStor appeared. For all but the Biological Bulletin, BioStor appeared in the top ten (i.e., on the first page of results):

JournalGoogle rank of BioStor page
Biological Bulletin12
Bulletin of Zoological Nomenclature6
Proceedings of the Entomological Society, Washington6
Proc. Linn. Soc. New South Wales3
Annals of the Missouri Botanical Garden3
Tijdschr. Ent.2
Transactions of The Royal Entomological Society of London6
Ann. Mag. nat. Hist3
Notes from the Leyden Museum5
Proceedings of the United States National Museum4

This suggests that BioStor's content is a least findable.

Where next?
The sense I'm getting from these stats is that BioStor is being used, and it seems to be a reaosnably successful, small-scale project. It would be nice to play with the Google Analytics output a bit more, and also explore usage patterns more closely. For example, I invested some effort in adding the ability to create PDFs for BioStor articles, but I've no stats on how many PDFs have been downloaded. Metadata in BioStor is editable, and edits are logged, but I've not explored the extent to which the content is being edited. If a serious effort is going to be made to clean up BHL content using crowd sourcing, I'll need to think of ways to engage users. The wiki experiments were a step in this direction, but I suspect that building a network around this task might prove difficult. Perhaps a better way is to build the network elsewhere, then try to engage it with this task (OCR correction). This was one reason behind my adopting Mendeley's OAuth API to provide a sign in facility for BioStor (see Mendeley connect). Again, I've no stats on the extent to which this feature of BioStor has been used. Time to give some serious thought to what else I can learn about how BioStor is being used.