iPhylo: October 2010

Roderic D. M. Page

Monday, October 25, 2010

Are names really the key to the big new biology?

David ("Paddy") Patterson, Jerry Cooper, Paul Kirk, Rich Pyle, and David Remsen have published an article in TREE entitled "Names are key to the big new biology" (doi:10.1016/j.tree.2010.09.004). The abstract states:

Those who seek answers to big, broad questions about biology, especially questions emphasizing the organism (taxonomy, evolution and ecology), will soon benefit from an emerging names-based infrastructure. It will draw on the almost universal association of organism names with biological information to index and interconnect information distributed across the Internet. The result will be a virtual data commons, expanding as further data are shared, allowing biology to become more of a ‘big science’. Informatics devices will exploit this ‘big new biology’, revitalizing comparative biology with a broad perspective to reveal previously inaccessible trends and discontinuities, so helping us to reveal unfamiliar biological truths. Here, we review the first components of this freely available, participatory and semantic Global Names Architecture.

Do we need names?

Reading this (full disclosure, I was a reviewer) I can't wondering whether the assumption that names are key really needs to be challenged. Roger Hyam has argued that we should be calling time on biological nomenclature, and I wonder whether for a generation of biologists brought up on DNA barcodes and GPS, taxonomy and names will seem horribly quaint. For a start, sequences and GPS coordinates are computable, we can stick them in computers and do useful things with them. DNA barcodes can be used to infer identity, evolutionary relationships, and dates of divergence. Taken in aggregate we can infer ecological relationships (such as diet, e.g., doi:10.1371/journal.pone.0000831), biogeographic history, gene flow, etc. While barcodes can tells us something about an organism, names don't. Even if we have the taxonomic description we can't do much with it — extracting information from taxonomic descriptions is hard.

Furthermore, formal taxonomic names don't seem terribly necessary in order to do a lot of science. Patterson et al. note that taxa may have "surrogate" names":

Surrogates include provisional names and specimen, culture or strain numbers which refer to a taxon. 'SAR-11' ('SAR' refers to the Sargasso Sea) was a surrogate name given in 1990 to an important member of the marine plankton. Only a decade later did it become known as Pelagibacter ubique.

The name Pelagibacter ubique was published in 2002 (doi:10.1038/nature00917), although as a Candidatus name (doi:10.1099/00207713-45-1-186), not a name conforming to the International Code of Nomenclature of Bacteria. I doubt the lack of a name that follows this code is hindering the study of this organism, and researchers seem happy to continue to use 'SAR11'.

So, I think that as we go forward we are going to find nomenclature struggling to establish its relevance in the age of digital biology.

If we do need them, how do we manage them?
If we grant Patterson et al. their premise that names matter (and for a lot of the legacy literature they will), then how do we manage them? In many ways the "Names are key to the big new biology" paper is really a pitch for the Global Names Architecture or GNA (and it's components GNI, GNITE, and GNUB). So, we're off into alphabet soup again (sigh). The more I think about this the more I want something very simple.

Names
All I want here is a database of name strings and tools to find them in documents. In other words, uBio.

Documents
Broadly defined to include articles, books, DNA sequences, specimens, etc. I want an database of [name,document] pairs (BHL has a huge one), and a database of documents.

Realistically, given the number and type of documents there will be several "document" databases, such as GenBank and GBIF. For citations Mendeley looks very promising. If we had every taxonomic publication in Mendeley, tagged with scientific names, then we'd have the bibliography of life. Taxonomic nomenclators would be essentially out of business, given that their function is to store the first publication of a name. Given a complete bibliography we just create a timeline of usage for a name and note the earliest [name,document] pair:

Taxonomy
There are a few wrinkles to deal with. Firstly, names may have synonyms, lexical variants, etc. (the Patterson et al. paper has a nice example of this). Leaving aside lexical variants, what we want is a "view" of the [name,document] pairs that says this subset refer to the same thing (the "taxon concept").

We can obsess with details in individual cases, but at web-scale there are only two ones that spring to mind. The first is the Catalogue of Life, the second is NCBI. The Catalogue of Life lists sets of names and reference that it regards as being the same thing, although it does unspeakable things to many of the references. In the case of NCBI the "concepts" would be the sets of DNA sequences and associated publications linked to the same taxonomy id. Whatever you think of the NCBI taxonomy, it is at least computable, in the sense that you could take a taxon and generate a list of publications 'about" that taxon.

So, we have names, [name,document] pairs, and sets of [name,document] pairs. Simples.

Thursday, October 21, 2010

Mendeley, BHL, and the "Bibliography of Life"

One of my hobby horses is the disservice taxonomic databases do their users by not linking to original scientific literature. Typically, taxonomic databases either don't cite primary literature, or regurgitate citations as cryptic text strings, leaving the user to try and find item being referred to. With the growing number of publishers that are digitising legacy literature and issuing DOIs, together with the Biodiversity Heritage Library's (BHL) enormous archive, there's really no excuse for this.

Taxonomic databases often cite references in abbreviated forms, or refer to individual pages, rather than citable units such as articles (see my Nomenclators + digitised literature = fail post for details). One way to translate these into links to articles would be to have a tool that could find a page within an article, or could match an abbreviated citation to a full one. This task would be relatively straightforward if we had the "bibliography of life," a freely accessible bibliography of every taxonomic paper ever published. Sadly, we don't...yet.

Bibliography of life

Mendeley is rapidly building a very large bibliography (although exactly how large is a matter of some dispute, see Duncan Hull's How many unique papers are there in Mendeley?), and I'm starting to explore using it as a way to store bibliographic details on a large scale. For example, an increasing number of smaller museum or society journals are putting lists of all their published articles on the web. Typically these are HTML pages rather than bibliographic data, but with a bit of scraping we can convert them to something useful, such as RIS format and import them in to Mendeley. I've started to do this, creating Mendeley groups for individual journals, e.g.:

These lists aren't necessarily complete nor error-free, but they contain the metadata for several thousand articles. If individual societies and museums made their list of publications freely available we would make significant progress towards building a bibliography of life. And with the social networking features of Mendeley, we could have groups of collaborators clean up any errors in the metadata.

Of course, this isn't the only way to do this. I suspect I'm rather atypical in building Mendeley groups containing articles from only one journal, as opposed to groups based on specific topics, and of course we could also tackle the problem by creating groups with a taxonomic focus (such as all taxonomic papers on amphibians). Furthermore, if and when more taxonomists join Mendeley and share their personal bibliographies, we will get a lot more relevant articles "for free." This is Mendeley's real strength in my opinion: it provides rich tools for users to do what they most want to do (manage their PDFs and cite them when they write papers), but off the back of that Mendeley can support larger tasks (in the same way that Flickr's ability to store geotagged photos has lead to some very interesting visualisations of aggregated data).

BioStor

For some of the journals I've added to Mendeley I just have bibliographic data, the actual content isn't freely available on line, and in some cases isn't event digitised. But for some journals the content exists in BHL, it's "just" a matter of finding it. This is where my BioStor project comes in. For example, BHL has scanned most of the journal Spixiana. While BHL recognises individual volumes (see http://www.biodiversitylibrary.org/bibliography/40214) it has no notion of articles. To find these I scraped the tables of contents on the Spixiana web site and ran them through BioStor's OpenURL resolver. If you visit the BioStor page for the journal (http://biostor.org/issn/0341-8391) you will see that most of the articles have been identified in BHL, although there are a few holes that will need to be filled.

These articles are listed in a Mendeley group for Spixiana, with the articles linked to BioStor wherever possible.

CiteBank and on not reinventing the wheel
If we were to use Mendeley as the primary platform for aggregating taxonomic publications, then I see this as the best way to implement "CiteBank". BHL have created CiteBank as an "an open access repository for biodiversity publications" using Drupal. Whatever one thinks of Drupal, bibliographic management is not an area where it shines. I think the taxonomic community should take a good look at Mendeley and ask themselves whether this is the platform around which they could build the bibliography of life.

Friday, October 08, 2010

Towards an interactive DjVu file viewer for the BHL

The bulk of the Biodiversity Heritage Library's content is available as DjVu files, which package together scanned page images and OCR text. Websites such as BHL or my own BioStor display page images, but there's no way to interact with the page content itself. Because it's just a bitmap image there's no obvious way to do simple things such as select and copy some text, click on some text and correct the OCR, or highlight some text as a taxonomic name or bibliographic citation. This is frustrating, and greatly limits what we can do with BHL's content.

In March I wrote a short post DjVu XML to HTML showing how to pull out and display the text boxes for a DjVu file. I've put this example, together with links to the XSLT file I use to do the transformation online at Display text boxes in a DjVu page. Here's an example, where each box (a DIV element) corresponds to a fragment of text extracted by OCR software.

The next step is to make this interactive. Inspired by Google's Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?), I've revisited this problem. One thing the Google PDF viewer does nicely is enable you to select a block of text from a PDF page, in the same way that you can in a native PDF viewer such as Adobe Acrobat or Mac OS X Preview. It's quite a trick, because Google is displaying a bitmap image of the PDF page. So, can we do something similar for DjVu?

The thing I'd like to do is something what is shown below — drag a "rubber band" on the page and select all the text that falls within that rectangle:

This boils down to knowing for each text box whether it is inside or outside the selection rectangle:

Implementation

We could try and solve this by brute force, that is, query each text box on the page to see whether it overlaps with the selection or not, but we can make use of a data structure called an R-tree to speed things up. I stumbled across Jon-Carlos Rivera's R-Tree Library for Javascript, and so was inspired to try and implement DjVu text selection in a web browser using this technique.

The basic approach is as follows:

Extract text boxes from DjVu XML file and lay these over the scanned page image.

Add each text box to a R-tree index, together with the "id" attribute of the corresponding DIV on the web page, and the OCR text string from that text box.

Track mouse events on the page, when the user clicks with the mouse we create a selection rectangle ("rubber band"), and as the mouse moves we query the R-tree to discover which text boxes have any portion of their extent within the selection rectangle.

Text boxes in the selection have their background colour set to an semi-transparent shade of blue, so that the user can see the extent of the selected text. Boxes outside the selection are hidden.

When the user releases the mouse we get the list of text boxes from the R-tree, and concatenate the text corresponding to each box, and finally display the resulting selection to the user.

Copying text

So far so good, but what can we do with the selected text? One obvious thing would be to copy and paste it (for example, we could select a species distribution and paste it into a text editor). Since all we've done is highlight some DIVs on a web page, how can we get the browser to realise that it has some text it can copy to the clipboard? After browsing Stack Overflow I came across this question, which gives us some clues. It's a bit of a hack, but behind the page image I've hidden a TEXTAREA element, and when the user has selected some text I populate the TEXTAREA with the corresponding text, then set the browser's selection range to that text. As a consequence, the browser's Copy command (⌘C on a Mac) will copy the text to the clipboard.

Demo

You can view the demo here. It only works in Safari and Chrome, I've not had a chance to address cross-browser compatibility. It also works in the iPad, which seems a natural device to support interactive editing and annotation of BHL text, but you need to click on the button On iPad click here to select text before selecting text. This is an ugly hack, so I need to give a bit more thought to how to support the iPad touch screen, while still enabling users to pan and zoom the page image.

Next steps

This is all very crude, but I think it shows what can be done. There are some obvious next steps:

Enable selected text to be edited so that we can correct the underlying OCR text.

Add tools that operate on the selected text, such as check whether it is a taxonomic name, or if it is a bibliographic citation we could attempt to parse it and locate it online (such as David Shorthouse's reference parser).

Select parts of the page image itself, so that we could extract a figure or map.

Add "post it note" style annotations.

Add services that store the edits and annotations, and display annotations made by others.

Lots to do. I foresee a lot of Javascript hacking over the coming weeks.

Tuesday, October 05, 2010

Scripting Life

Not really a blog post, more a note to self. If I ever did get around to writing a book again, I think Scripting Life would be a great title.

PLoS Biodiversity Hub launches

The PLoS Biodiversity Hub has launched today. There's a PLoS blog post explaining the background to the project, as well as a summary on the Hub itself:

The vision behind the creation of PLoS Hubs is to show how open-access literature can be reused and reorganized, filtered, and assessed to enable the exchange of research, opinion, and data between community members.

PLoS Hubs: Biodiversity provides two main functions to connect researchers with relevant content. First, open-access articles on the broad theme of biodiversity are selected and imported into the Hub. In time, the content will also be enhanced so that the articles are connected with data, and we will provide features to make the articles easier for people to use. These two functions - aggregation and adding value - build on the concept of open access, which removes all the barriers to access and reuse of journal article content.

Readers of iPhylo may recall my account of one of the meetings involved in setting up this hub, in which I began to despair about the lack of readiness of biodiversity informatics to provide much of the information needed for projects such as hubs. Despite this (or perhaps, because of it), I've become a member of the steering committee for the Biodiversity Hub. There's clearly a lot of interest in repurposing the content found in scientific articles, and I think we're going to see an increasing number of similar projects from the major players in science publishing, Open Access or otherwise. One of the challenges is going to be moving beyond the obvious things (such as making taxon names clickable) to enable new kinds of ways of reading, navigating, and querying the literature, and exploring ways to track the use that is made of the information in these articles. Biodiversity studies are ideally placed to explore this as the subject is data rich and much of that data, such as specimens and DNA sequences, persist over time and hence get reused (data citation gets very boring if the data is used just once). We also have obvious ways to enrich navigation, such as spatially and taxonomically.

For now the PLoS Biodiversity Hub is very pretty, but it's more a statement of intent than a real demonstration of what can be done. Let's hope our field gets its act together and seizes the opportunity that initiatives like the Hub represents. Publishers are desperate to differentiate themselves from their competitors by providing added value as part of the publication process, and they provide a real use case for all the data that the biodiversity projects have been accumulating over the last couple of decades.

Friday, October 01, 2010

Replicating and forking data in 2010: Catalogue of Life and CouchDB

Time (just) for a Friday folly. A couple of days ago the latest edition of the Catalogue of Life (CoL) arrived in my mailbox in the form of a DVD and booklet:

While in some ways it's wonderful that the Catalogue of Life provides a complete data dump of its contents, this strikes me as a rather old-fashioned way to distribute it. So I began to wonder how this could be done differently, and started to think of CouchDB. In particular, I began to think of being able to upload the data to a service (such as Cloudant) where the data could be stored and replicated at scale. then I began to think about forking the data. The Catalogue of Life has some good things going for it (some 1.25 million species, and around 2 million names), and is widely used as the backbone of sites such as EOL, GBIF, and iNaturalist.org, but parts of it are broken. Literature citations are often incomplete or mangled, and in places it is horribly out of date.

Rather than wait for the Catalogue of Life to fix this, what if we could share the data, annotate it, correct mistakes, and add links? In particular, what if we link the literature to records in the Biodiversity Heritage Library so at we can finally start to connect names to the primary literature (imagine clicking on a name and being able to see the original species description). We could have something akin to github, but instead of downloading and forking code, we download and fork data. CouchDB makes replicating data pretty straightforward.

So, I've started to upload some Catalogue of Life records to a CouchDB instance at Cloudant, and write a simple web site to display these records. For example, you can see the record for at http://iphylo.org/~rpage/col/?id=e9fda47629c1102b9a4a00304854f820:

The e9fda47629c1102b9a4a00304854f820 in this URL is the UUID of the record in CouchDB, which is also the UUID embedded in the (non-functional) CoL LSIDs. This ensures the records have a unique identifier, but also one that is related to the original record. You can search for names, or browse the immediate hierarchy around a name. I hope to add more records over time as I explore this further — at the moment I've added a few lizards, wasps, and conifers while I explore how to convert the CoL records into a sensible JSON object to upload to CouchDB.

The next step is to think about this as a way to distribute data (want a copy of CoL, just point your CouchDB at the Cloudant URL and replicate it), and to think about how to build upon the basic records, editing and improving them, then thinking about how to get that information into a future version of the Catalogue.