iPhylo: January 2012

Roderic D. M. Page

Monday, January 30, 2012

BLAST a sequence and get a tree

For this weeks sessions of my phyloinformatics course I'm developing some phylogeny tools. The first is a simple AJAX-based BLAST tool. I've always wanted a quick way to see a GenBank sequence in its phylogenetic context, so I've built a simple tool to that takes a GenBank accession number or GI number, submits a BLAST job, retrieves the sequences, aligns them using CLUSTALW, builds a quick and dirty neighbour-joining tree using PAUP*, then displays the tree using SVG (if your browser doesn't support this you won't see the tree). One use for this is to quikcly get a sense of whether an unnamed ("dark") taxon is related to sequences that have been identified.

Nothing fancy, but it was a chance to display the whole process in the browser without opening new windows or refreshing the page. Here's an example for the GenBank sequence FJ559186:

For the technically-minded, the calls to BLAST and the alignment and tree construction tools all use AJAX, and there's a simple Javascript timer to countdown the seconds that the NCBI BLAST web service estimates the BLAST job will take, before we poll NCBI to see if the job has in fact finished. The code is in GitHub.

Thursday, January 26, 2012

Extracting museum specimen codes from text

Quick note about a tool I've cobbled together as part of the phyloinformatics course, which addresses a long standing need I and others have to extract specimen codes from text. I've had this code kicking around for a while (as part of various never-finished data mining projects), but never got around to releasing it, until now. It is very crude (basically a bunch of regular expressions), and there's a lot which could be done to improve it (not least starting with a complete list of museum specimen codes, rather than just those I've come across in, say Zootaxa and BioStor).

You can try the tool at http://iphylo.org/~rpage/phyloinformatics/services/specimenparser.php. Paste in some text and it will try and extract museum codes. The tool tries to handle ranges of specimens (e.g., MHNSM 1808-09), and some of the more common specimen numbering schemes.

Comments welcome. If you are looking for a source of text, papers in Zookeys or Zootaxa are a good place to start (especially papers on vertebrates where specimen numbers are often used). BioStor is also a good source: if you're looking at a paper in BioStor click on the "Text" link to get the OCR text for an article and paste that into the form at . For example, the text for Systematics of the Bufo coccifer complex (Anura: Bufonidae) of Mesoamerica is available at http://biostor.org/reference/97426.text.

The extraction tool can also be called as a web service using POST to get back the results in JSON.

Monday, January 23, 2012

Open course on phyloinformatics

As part of a postgraduate course here at the University of Glasgow I'm teaching five sessions on "phyloinformatics", which I've decided to define broadly enough to encompass most of biodiversity informatics.

Given that this module is being developed on the fly, and will make use of lots of little "toys" I've developed and discussed on this blog, I've decided to put the course notes online, along with the interactive demos and the source code. So, if you want to follow along for the next couple of weeks, here are the links:

Course home page
Course notes and exercises (currently just the introductory session)
Source code on GitHub (including code for my EOL iPad webapp)

Each course page supports comments (see the bottom of the page), so feel free to add comments, or suggestions. The notes are at a crude stage, and will be developed over the duration of the course (2 weeks). I'm also endeavouring to get all the source code for the demonstration apps into GitHub. None of these demos is polished, but they will hopefully provide some ideas for taking them further. There will be iSpecies-like mashups, iPad webapps, classification visualisations, TreeBASE search tools, geophylogenies and other phylogeny viewers.

Thursday, January 19, 2012

EOL iPad web app using jQueryMobile

As part of a course on "phyloinformatics" that I'm about to teach I've been making some visualisations of classifications. Here's one I've put together using jQuery Mobile and the Encyclopedia of Life API. It's pretty limited, but is a simple way to explore EOL using three different classifications. You can view this live at http://iphylo.org/~rpage/phyloinformatics/eoliphone/ (looks best on an iPad or iPhone). Once I've tidied it up I'll put the code online. Meantime here's a quick demo:

Wednesday, January 18, 2012

Yet another reason why we need specimen identifiers, now!

This message appeared on the TAXACOM mailing list:

It is getting more and more necessary for taxonomists to demonstrate
that they are useful and used. This does not only apply to the
individual scientists, but also to institutions with taxonomic
collections, such as museums and herbaria.

In an attempt to live up to that increasing demand for documentation,
the leadership of the Natural History Museum of Denmark has issued an
order to its curatorial staff - The staff members are requested to
document which publications from 2011, written entirely by external
scientists, that in one way or another are based on material in the
collections of the Museum.

Given that most specimens lack resolvable digital identifiers (a theme I've harped on about before, most recently in the context of DNA barcoding), answering this kind of query ends up being a case of searching publications for text strings that contain the acronym of the collection. The sender of the message, Ib Friis, is alarmed at this prospect:

In publications, material from our herbarium at "C" is normally referred
to in text strings of one of the following forms: "(C)", "(C, ", ", C,"
or " C)". But a search in for example Google Scholar or other search
engines result in overflow of thousands and thousands of hits, even
when these text strings are combined with other relevant words such as
"botany", "plants", etc.

In an earlier paper "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" (http://dx.doi.org/10.1093/bib/bbn022) (free preprint available here: hdl:10101/npre.2008.1760.1) I argued that having resolvable identifiers for specimens could enable measures of "citation" to be computed for specimens (and data derived from those specimens). Just as we have citation counts for articles and impact factors for journals, we could have equivalent measures for specimens and collections. These measures may keep administrators happy, for scientists I think the real benefits will be the ability to trace the provenance of some data, and the fate of data they themselves have collected or published.

For things such as publications it is trivial to track their usage. For example, to find the number of times the article "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" has been cited, I simply enter the DOI into Google Scholar, e.g. http://scholar.google.co.uk/scholar?q=10.1093/bib/bbn022. Imagine being able to do the same for specimens?

For this to happen, museum specimens need digital identifiers. If museums are serious about quantifying the impact of their collections, they should make assigning digital identifiers a priority.

Tuesday, January 17, 2012

Mendeley as CiteBank: some ideas

Here are some quick notes on how BHL could use Mendeley as a "CiteBank".

As a repository of bibliographic data

If the goal is to assemble a "bibliography of life" then there are various ways this could be done.

Taxon-specific bibliographies

Create groups that are taxon-specific (or find existing groups in Mendeley. For example, I've created groups for amphibias (Amphibian Species of the World) and reptiles (TIGR/JCVI Reptile Database) based on the Amphibian Species of the World and TIGR/JCVI Reptile Database, respectively. Taxon-specific groups are probably going to be attractive to users, but the quality of bibliographic metadata can be variable. However, a bibliography for a specific taxonomic group that is populated with links to BHL content would be very useful.

Journal-specific bibliographies

This is where I've spent most of my efforts. I've created around 300 groups for various journals (see list below, or go directly to http://dl.dropbox.com/u/639486/groups.html). In some cases I've managed to populate these with the complete set of articles published in that journal, typically harvested from the journal's own web site. Typically the metadata from journal sites is high quality, although one has to be wary of Orwellian metadata.

I use these groups in two ways. The first is as a source of metadata for extracting articles from BHL using BioStor. If you have article-level metadata finding articles in BHL becomes easier, and can be automated so that 1000's can be added in a few minutes.

The second is for the taxon-literature mapping project, where one strategy is to use approximate string mapping to find equivalent citations in Mendeley and the ION database. Ultimately I'd like to link to the Mendeley citations as they tend to be higher quality than those in the original ION database.

BHL could create Mendeley groups for journals it has scanned, and populate those.

As an article-level index to BHL

This is perhaps the most direct way BHL could use Mendeley is as follows:

Create a BHL account.
For each BHL title create a Mendeley group (the name would be the BHL TitleID).
For each item in that title create a folder in the corresponding group (the folder name would be the ItemID).
Within each folder list the articles, book chapters or other component parts. If these aren't available yet, encourage people to add them. Some of these could be pre-populated with content from BioStor.
Harvest the contents of these groups to provide an article-level index to BHL (which for me is the single biggest impediment to using BHL). Previously I've suggested a way to easily add article data to BHL, Mendeley title/item groups and folders might be way to facilitate this process.

PDF storage

Although Mendeley offers PDF storage, this is one feature I'd be less inclined to use. Mendeley's rule for sharing PDFs and making them publicly available are too restrictive (they often don't know whether a PDF can, in fact, be shared). Plus you want tools to visualise, index, and archive PDFs. In effect a big file store with added features. I have some ideas on how this can be implemented (and have a rough working version to support http://iphylo.org/~rpage/itaxon). Alternatively, one could use Internet Archive services.

Summary

As I've often argued, given the success of tools like Mendeley it seems pointless for anyone to try and build yet another online bibliographic database. The trick is to figure out how to leverage what Mendeley provides to support what the taxonomic (and broader biodiversity) community needs.

Tuesday, January 10, 2012

Journals I'd like BHL to scan

I've recently updated my database of links between animal taxonomic names and literature identifiers, which now has over 280,000 names linked to some form of identifier (127,000 of these being DOIs). You can see the current version here:

http://iphylo.org/~rpage/itaxon/

As an experiment I've added a feature to list the number of names for each journal. Based on this list (limited to journals that I've found an ISSN for) here are some journals I'd like to see digitised by the Biodiversity Heritage Library (BHL). Note that by digitised I mean beyond the 1923 cutoff applied to many journals. This will mean negotiating with the journal publishers, but in a number of cases these are scientific societies or institutions, some associated with BHL. Given that major partners in BHL have made post-1923 content available, it would nice to extend this to other key taxonomic journals.

Revue Suisse de Zoologie

Revue Suisse de Zoologie has published nearly 10,000 taxonomic names but has essentially zero digital presence, which is extraordinary. Another Swiss journal, Entomologica Basiliensia is also an obvious candidate.

Revue de Zoologie et de Botanique Africaines

Revue de Zoologie et de Botanique Africaines has published over 5,000 names, and given the interest in providing information resources for Africa (e.g., http://www.mendeley.com/groups/1681811/bhl-africa/) this seems an obvious journal to scan completely.

Bulletin of the British Museum (Natural History) journals and books

The Natural History Museum [formerly British Museum (Natural History)] is a member of BHL so I'd expect it to have better coverage of it's own publications in BHL. There are gaps in journals such as Bulletin of the British Museum (Natural History) Entomology, which means there is a significant chunk of research published by Museum staff that simply doesn't exist digitally. At one point The Natural History Museum renamed the journals and moved them to Cambridge University Press, resulting in further gaps in digitisation. It's interesting that museums that haven't changed the title of their publications (such as the American Museum of Natural History and the Australian Museum) have better digital coverage than the NHM, which has flirted with various title changes in the last few decades. The Museum also published a series of monographs in the 20th century, many of these aren't in BHL.

Memoirs of the Queensland Museum

The Memoirs of the Queensland Museum is an important journal (> 3,000 names) but has only early issues scanned in BHL and recent issues as PDFs on the Museum web site (vulnerable to link rot when the site gets redesigned, as I've discovered to my cost).

Russian journals

Russian journals contain large numbers of taxonomic descriptions, but their digital presence is patchy. Springer has started to publish translations online (e.g., http://dx.doi.org/10.1134/S0013873810050155 in Entomological Review, which is a translation of an article in Zoologicheskii Zhurnal), but much of the Russian literature seems unavailable in digital form. BHL has spread from it's US-UK origins to BHL-Europe, BHL_China, and BHL_Australia, maybe it's time for BHL-Russia?

Summary

There are huge holes in the availability of taxonomic literature (where I equate "availability" with being digitised and online, free or otherwise). But on the other hand I've been pleasantly surprised by just how much taxonomic literature is online. It looks quite feasible to link at least 300,000 animal names to digital publications.

The journals I've highlighted are just a few obvious candidate for scanning. I suspect that as one goes down the list of taxonomic journals the rate of return will decline, to the point where scanning entire journals will be less efficient than scanning targeted articles.