iPhylo: January 2010

Roderic D. M. Page

Friday, January 29, 2010

Why I want an iPad

OK, first of all, I want one, I want one real bad.

There's been a general sense of disappointment about the iPad, which I suspect is only natural given the enormous hype leading up to the announcement, as well as the fact that the applications shown were fairly conventional. Personally I don't think book reading is where the action is. For time-sensitive stuff like newspapers, and rich, complex documents such as scientific papers, sure, but physical books strike me as a piece of technology that we're not really going to improve on, rather like knives and forks.

But some grasp that this is magic. What I hope the iPad will do is finally move some visualisation tools into the main stream (as much as phylogenetics can be thought of as mainstream). The challenge of visualising large phylogenies has yielded some cool tools which, sadly, remained under-developed, such as TreeJuxtaposer, which seems clunky and counter-intuituve when using a mouse, but with a touch screen would just be awesome.

Tools such as Paloverde would also be more intuitive to use, as would the magnifier feature in Dendroscope. Imagine "pinch and zoom" in TreeJuxtaposer or Dendroscope, or for viewing large a sequence alignments.

Then there's the existing tabletop tools that I blogged about earlier:

And of course there's Perceptive Pixel's view of a taxonomic classification:

There would be some work involved porting these tools to the iPad (e.g., porting code from Java to Objective C in the case of TreeJuxtaposer and Dendroscope), but the person who does this is going to have an impact on this field comparable to Maddison brothers when they released MacClade in 1986.

Monday, January 25, 2010

Displaying taxonomic coverage using a treemap

A quick, and not altogether satisfactory hack, but I've added a simple interactive treemap to BioStor. It's essentially a remake of the Catalogue of Life treemap I created in 2008, but coloured by the number of references I've extracted from BHL. I wanted a quick way to visualise which groups were well represented (darker colours), and which weren't (lighter colours). For example, in the diagram below we can see that Annelida, Arthropoda, Chordata, Cnidaria, and Mollusca have most of the references.

The treemap at BioStor is interactive. You can click on a panel in the treemap to drill down. To go back up, click on the desired level in the tree displayed to the left of the treemap.

Friday, January 15, 2010

Why BHL needs automation and/or crowd sourcing

Jim Croft drew my attention to a cool crowd-sourcing project to convert scans of Australia's newspapers to text. The site has a nice chart showing the projects' coverage of the Australian newspapers, which motivated me to show something similar for BioStor. Each journal page now shows a chart of article coverage.

For example, the journal Breviora has most of its articles identified, mainly because the MCZ put a list of articles linked to BHL on their website. I harvested this to populate BioStor.

Other journals are looking a bit more sparse, such as the Bulletin of the British Museum (Natural History) (Entomology):

So, there's a lot to do, and a crying need for some combination of automation (whether driven by external metadata, which is my current approach, or trying to extract article information directly from the OCR text) and crowd sourcing if we're going to be able to make a significant dent in the task of finding articles in BHL.

Wednesday, January 13, 2010

BioStor so far

My BioStor project has reached over 13,000 articles, making it a sizeable respository of open access articles on biodiversity. It's still a tiny fraction of what could be extracted from the Biodiversity Heritage Library (BHL), but perhaps it's worth taking stock of what's there.

Coverage

One pleasing discovery is that, despite the 1923 cut-off due to U.S. copyright, BHL contains a lot of post-1923 articles. Indeed, a sparkline of the number of articles over time shows that the bulk of the articles I've extracted from BHL are from the second half of the twentieth Century. These include journals such as Entomological News, and major herbaria and museum publications, such as Annals of the Missouri Botanical Garden, Breviora, Bulletin of the British Museum (Natural History). Zoology, and Proceedings of the United States National Museum.

This distribution partly reflects my own biased harvesting of BHL, but perhaps it also reflects a transition in publishing (from books and monographs to journals).

Exhibit

The interface to BioStor is still pretty rough, but one thing I've had fun adding is the SIMILE Exhibit widget to the author pages. Exhibit enables faceted browsing, so you can filter an author's list of publications by the journal they published, what year the article was published, and who was a coauthor.

One use I've found is that you can quickly filter out purely nomenclatural papers by select an author's publications in the Bulletin of Zoological Nomenclature.

Sparklines

I'm a fan of sparklines, and I'm a starting to add some to various pages. For example, you can now see a graph of how many papers an author has published over time, the example below is for the prolific Charles P Alexander:

Search

This pretty much sums up search in BioStor at the moment:

Harvesting

The rate of growth of BioStor has been pretty steady, partly because I keep managing to find (or are alerted to, or are given) sources of bibliographic metadata. I'm also adding individual papers as I come across them, either using the Reference finder form directly, or by using OpenURL from within Endnote. But there's still a crying need for large bibliographies that can be harvested.

Export

Individual references can be exported in a range of formats, and there is a RSS feed, but I need to add bulk export (including the ability to dump the whole database).

Viewer

The viewer I developed earlier is working reasonably well, but could do with sprucing up. I'm also toying with using other viewers as well as (or instead of) mine. One possibility is iPaper (although this is Flash-based, and I'd like to avoid Flash wherever possible).

Taxonomic names

One thing I've discovered to my horror is just how common it is for the BHL taxon finder to miss obvious taxonomic names. I need to investigate this a bit more, but this is potentially a major issue as taxonomic indexing is one of the primary ways people will find content in BioStor

Future

There's still so much to do. For one thing, the visualisations that originally got me seriously thinking about playing with BHL aren't in BioStor yet. And there's a long to-do list.

Monday, January 04, 2010

Thoughts on the International Year of Biodiversity 2010

Given that a new decade prompts predictions, as well as New Year's resolutions, and that 2010 is the International Year of Biodiversity, which comes complete with glossy web sites and calls for action, I'm making some predictions of my own, inspired in part by Eric Hellman's Ten Predictions for the Next Ten Years. I won't be nearly as bold as Eric, I'm limiting myself to biodiversity informatics, and the coming year. Here are my predictions:

The Encyclopedia of Life will continue it's slow decline into irrelevance. Nobody will care, as we have Wikipedia.

Catalogue of Life (CoL) will issue another release, complete with much fanfare. The LSIDs for the 2009 release (which have never worked) will continue to fail, LSIDs for 2010 either won't be released, or will fail. Nobody will care.

There will be much talk of integrating biodiversity data. Unless GBIF adopts resolvable identifiers for specimens, and a major nomenclator or taxonomic name database (re)uses resolvable identifiers for literature (e.g., DOIs and BHL URLs), nothing of significance in this area will happen. Database providers will continue to confuse "link integration" (i.e., sharing URLs, doi:10.1038/nrg1065) with genuine integration.

For most young scientists GenBank will be the dominant source of information about biodiversity. If it hasn't been sequenced, they won't care about it.

DNA barcoding by itself will become boring, but will be the best tool with which to engage the public about taxonomy (e.g., Barcoding, taxonomy and citizen CSI).

Literature that is not online will cease to be read. Taxonomic groups where the literature is not online will effectively cease to be studied.

The major databases will continue to be riddled with errors. These will be numerous enough to be annoying, but not so numerous as to prevent useful work being done. The databases will make no (serious) effort to fix these (doi:10.1126/science.319.5870.1598).

No major database effort will adopt wikis.

Data providers such as Thomson Reuters (Index of Organism Names) will continue to clutch to debilitating notions of "intellectual property." As the coverage of the Biodiversity Heritage Library, and the reach of Google's indexing increase, commercial indexing services will become irrelevant.

The chasm between the classifications that underlie efforts such as EOL, and phylogenetic trees being generated by systematists will grow. Neither community will care.

What are your predictions?