Friday, December 07, 2012

Elsevier articles have interactive phylogenies

Elsevier treeSay what you will about Elsevier, they are certainly exploring ways to re-imagine the scientific article. In a comment on an earlier post Fabian Schreiber pointed out that Elsevier have released an app to display phylogenies in articles they publish. The app is based on jsPhyloSVGand is described here. You can see live examples in these articles:

Matos-Maraví, P. F., Peña, C., Willmott, K. R., Freitas, A. V. L., & Wahlberg, N. (2013). Systematics and evolutionary history of butterflies in the “Taygetis clade” (Nymphalidae: Satyrinae: Euptychiina): Towards a better understanding of Neotropical biogeography. Molecular Phylogenetics and Evolution, 66(1), 54–68. doi:10.1016/j.ympev.2012.09.005
Poćwierz-Kotus, A., Burzyński, A., & Wenne, R. (2010). Identification of a Tc1-like transposon integration site in the genome of the flounder (Platichthys flesus): A novel use of an inverse PCR method. Marine Genomics, 3(1), 45–50. doi:10.1016/j.margen.2010.03.001

Thursday, December 06, 2012

NEXUS parser and tree viewer in Javascript

Following on from the SVG experiments I've started to put some of the Javascript code for displaying phylogenies on Github. Not a repository yet, but as gists, little snippets of code. Mike Bostock has created which makes it possible to host gists as working examples, so you can play with the code "live".

The first gist takes a Newick tree, parses it and displays a tree. You can try it at

The second gist takes a basic NEXUS file containing a TREES block and displays a tree (try it at ). You can grab examples NEXUS tree files from TreeBASE such as tree Tr57874.

NexusWhy am I doing this?
Apart from "because it's fun" there are two reasons. The first is that I want a simple way to display phylogenetic trees in web pages, and doing this entirely in the web browser (Javascript parses the tree and renders it in SVG) saves me having to code this on my server. Being able to do this in the browser opens up the opportunity to embed tree descriptions in HTML, for example, and have the browser render the tree. This means the same web page can have machine-readable data (the tree description) but also generate a nice tree for the reader. As an aside, it also shows that TreeBASE could display perfectly good, interactive trees without resorting to a Java appelet.

The other reason is that the web seems to be moving to Javascript as the default language, and JSON as the standard data format. Instead of large chunks of "middleware" (written in a scripting language such as Perl, PHP, or, gack, Java) which is responsible for talking to databases on the server and sending static HTML to the web browser, we now have browsers that can support sophisticated, interactive interfaces built using HTML and Javascript. On the server side we have databases that speak HTTP (essentially removing the need for middleware), store JSON, and use Javascript as their programming language (e.g., CouchDB). In short, it's Javascript, Javascript, everywhere.

Wednesday, December 05, 2012

The Tree of Life

The following poem by David Maddison was published in Systematic Biology (doi:10.1093/sysbio/sys057) under a CC-BY-NC license.

I think that I shall never see
A thing so awesome as the Tree
That links us all in paths of genes
Down into depths of time unseen;

Whose many branches spreading wide
House wondrous creatures of the tide,
Ocean deep and mountain tall,
Darkened cave and waterfall.

Among the branches we may find
Creatures there of every kind,
From microbe small to redwood vast,
From fungus slow to cheetah fast.

As glaciers move, strikes asteroid
A branch may vanish in the void:
At Permian's end and Tertiary's door,
The Tree was shaken to its core.

The leaves that fall are trapped in time
Beneath cold sheets of sand and lime;
But new leaves sprout as mountains rise,
Breathing life anew 'neath future skies.

On one branch the leaves burst forth:
A jointed limb of firework growth.
With inordinate fondness for splitting lines,
Armored beetles formed myriad kinds.

Wandering there among the leaves,
In awe of variants Time conceived,
We ponder the shape of branching fates,
And elusive origins of their traits.

Three billion years the Tree has grown
From replicators' first seed sown
To branches rich with progeny:
The wonder of phylogeny.

Tuesday, December 04, 2012

Viewing phylogenies on the web: Javascript conversion of Newick tree to SVG

Quick test of an idea I'm playing with. By embedding a Newick-format tree description in HTML and adding some Javascript I can go from this:
<div class="newick" data-drawing-type="circlephylogram">((((((((219923430:0.046474,219923429:0.009145):0.037428,219923426:0.038397):0.015434,(219923419:0.022612,219923420:0.015561):0.050529):0.004828,(207366059:0.020922,207366058:0.016958):0.038734):0.003901,219923422:0.072942):0.005414,((219923443:0.038239,219923444:0.025617):0.037592,(219923423:0.056081,219923421:0.055808):0.003788):0.009743):0.001299,(219923469:0.072965,125629132:0.044638):0.012516):0.011647,(((((219923464:0.069894,((((((125628927:0.021470,219923456:0.021406):0.003083,219923455:0.021625):0.029147,219923428:0.042785):0.001234,225685777:0.037478):0.016027,((((56549933:0.003265,219923453:-0.000859):0.015462,150371743:0.009558):0.004969,219923452:0.014401):0.024398,((((((150371732:0.001735,((150371733:0,150371736:0):6.195e-05,150371735:-6.195e-05):7.410e-05):0.000580,150371734:0.001196):0.000767,(150371737:0.001274,(150371738:0,150371740:0):0.000551):0.000498):0.000905,70608555:0.003205):0.004807,150371741:0.010751):8.979e-05,150371739:0.006647):0.022090):0.012809):0.011838,219923427:0.057366):0.009364):0.004238,((219923450:0.022699,125628925:0.012519):0.048088,219923466:0.046514):0.003608):0.007025,((56549930:0.067920,219923440:0.059754):0.002384,((219923438:0.044329,219923439:0.038470):0.014514,(219923442:0.038021,(((207366060:0,207366061:0):0.001859,125628920:0.001806):0.024716,((((125628921:0.005610,207366057:0.003531):0.001354,(207366055:0.003311,207366056:0.002174):0.003225):0.011836,207366062:0.019303):0.003741,((((((207366047:0,207366048:0):0,207366049:0):0.001563,207366050:0.000272):0.002214,(207366051:0.000818,125628919:0.001017):0.000675):0.003916,207366054:0.007924):0.004138,((219923441:0.000975,207366052:-0.000975):0.000494,207366053:-0.000494):0.012373):0.010040):0.003349):0.017594):0.011029):-0.003134):0.011235):0.004149,((((219923435:0.064354,219923424:0.067340):0.002972,219923454:0.045087):0.002092,((219923460:0.027282,219923465:0.025756):0.031269,(219923462:0.017555,219923425:-0.009591):0.047358):0.006198):0.004242,(((219923463:0.031885,(219923459:0.000452,219923458:-0.000452):0.029292):0.005200,225685776:0.024691):0.020131,219923461:0.042563):0.004673):0.009128):0.001452,((56549934:0.088142,56549929:0.066475):0.004212,(219923437:0.048313,219923436:0.044997):0.014553):0.008927):0);</div> this (you will need an SVG-capable browser to see anything). The Javascript parses the Newick tree, generates SVG, then replaces the Newick tree in the HTML with the corresponding picture. No need for server-side graphics, the diagram is generated by your web browser based on the Newick tree description.
Here's the same tree as a phylogram:

Wednesday, November 28, 2012

ZooBank data model

I'm trying to get my head around the data model used by ZooBank to store taxonomic names. To do this, I've built a graph for the species Belonoperca pylei described by Baldwin & Smith described in:
Baldwin, C. C., & Smith, W. L. (1998). Belonoperca pylei, a new species of seabass (Teleostei: Serranidae: Epinephelinae: Diploprionini) from the cook islands with comments on relationships among diploprionins. Ichthyological Research, 45(4), 325–339. doi:10.1007/BF02725185

After extracting some data from ZooBank API I created a DOT file connecting the various "taxon name usages" associated with Belonoperca pylei and constructed a graph using GraphViz:
You can grab the DOT file here, and a bigger version of the image is on Flickr. I've labelled taxon names and references with plain text as well as the UUIDs that serve as identifiers in ZooBank. (Update: the original diagram had Belonoperca pylei Baldwin & Smith, 1998 sensu Eschmeyer [9F53EF10-30EE-4445-A071-6112D998B09B] in the wrong place, which I've now fixed.)

This is a fairly simple case of a single species, but it's already starting to look a tad complicated. We have Belonoperca pylei Baldwin & Smith, 1998 linked to its original description (doi:10.1007/BF02725185) and to the genus Belonoperca Fowler & Bean, 1930 (linked to its original publication as interpreted by ("sensu") Baldwin & Smith, 1998. Belonoperca Fowler & Bean 1930 sensu Baldwin & Smith 1998 is linked to the original use of that genus (i.e., Belonoperca Fowler & Bean, 1930). Then we have the species Belonoperca pylei Baldwin & Smith, 1998 as understood in Eschmeyer's 2004 checklist.

Notice that each usage of a taxon name gets linked back to a previous usage, and names are linked to higher names in a taxonomic hierarchy. When the species Belonoperca pylei was described it was placed in the genus Belonoperca, when Belonoperca was described it was placed in the family Serranidae, and so on.

Tuesday, November 27, 2012

Fuzzy matching taxonomic names using ngrams

Quick note to self about possible way to using fuzzy matching when searching for taxonomic names. Now that I'm using Cloudant to host CouchDB databases (e.g., see BioStor in the the cloud) I'd like to have a way to support fuzzy matching so that if I type in a name and misspelt it, there's a reasonable chance I will still find that name. This is the "did you mean?" feature beloved by Google users. There are various ways to tackle this problem, and Tony Rees' TAXAMATCH is perhaps the best known solution.

Cloudant supports Lucence for full text searching, but while this allows some possibility for approximate matching (by appending "~" to the search string) initial experiments suggested it wasn't going to be terribly useful. What does seem to work is to use ngrams. As a crude example, here is a CouchDN view that converts a string (in this case a taxon name) to a series of trigrams (three letter strings) then indexes their concatenation.

"_id": "_design/taxonname",
"language": "javascript",
"indexes": {
"all": {
"index": "function(doc) { if (doc.docType == 'taxonName') { var n = doc.nameComplete.length; var ngrams = []; for (var i=0; i < n-2;i++) { var ngram = doc.nameComplete.charAt(i) + doc.nameComplete.charAt(i+1) + doc.nameComplete.charAt(i+2); ngrams.push(ngram); } if (n > 2) { ngrams.push('$' + doc.nameComplete.charAt(0) + doc.nameComplete.charAt(1)); ngrams.push(doc.nameComplete.charAt(n-2) + doc.nameComplete.charAt(n-1) + '$'); } ngrams.sort(); index(\"default\", ngrams.join(' '), {\"store\": \"yes\"}); } }"

To search this view for a name I then generate trigrams for the query string (e.g., "Pomatomix" becomes "$Po Pom oma mat ato tom omi mix ix$" where "$" signals the start or end of the string) and search on that. For example, append this string to the URL of the CouchDB database to search for "Pomatomix":


Initial results are promising (searching on bigrams generated an alarming degree of matches that seemed rather dubious). I need to do some more work on this, but it might be a simple and quick way to support "did you mean?" for taxonomic names.

Thursday, November 22, 2012

BioStor in the cloud

CloudantQuick note on an experimental version of BioStor that is (mostly) hosted in the cloud. BioStor currently runs on a Mac Mini and uses MySQL as the database. For a number of reasons (it's running on a Mac Mini and my knowledge of optimising MySQL is limited) BioStor is struggling a bit. It's also gathered a lot of cruff as I've worked on ways to map article citations to the rather messy metadata in BHL.

So, I've started to play with a version that runs in the cloud using my favourite database, CouchDB. The data is hosted by Cloudant, which now provides full text search powered by Lucene. Essentially, I simply take article-level metadata from BioStor in BibJSON format and push that to Cloudant. I then wrote a simple wrapper around querying CouchDB, couple that with the Documentcloud Viewer to display articles and citeproc-js to format the citations (not exactly fun, but someone is bound to ask for them), and a we have a simple, searchable database of literature.

If you want to try the cloud-based version go to (code on Github).


I've been wanting to do this for a while, partly because this is how I will implement my entry in EOL's computational data challenge, but also because CrossRef's Metadata search shows the power of finding references simply by using full text search (I've shamelessly borrowed some of the interface styling from Karl Ward's code). David Shorthouse demonstrates what you can do using CrossRef's tool in his post Conference Tweets in the Age of Information Overconsumption. Given how much time I spend trying to parse taxonomic citations and match them to articles in CrossRef's database, or BioStor, I'm looking forward to making this easier.

There are two major limitations of this cloud version of BioStor (aprt from the fact it has only a subset of the articles in BioStor). The first is that the page images are still being served from my Mac Mini, so they can be a bit slow to load. I've put the metadata and the search engine in the cloud, but not the images (we're talking a terabyte or two of bitmaps).

The other limitation is that there's no API. I hope to address this shortly, perhaps mimicking the CrossRef API so if one has code that talks to CrossRef it could just as easily talk to BioStor.

Wednesday, November 21, 2012

Species wait 21 years to be described - show me the data

21Benoît Fontaine et al. recently published a study concluding that average lag time between a species being discovered and subsequently described is 21 years.

Fontaine, B., Perrard, A., & Bouchet, P. (2012). 21 years of shelf life between discovery and description of new species. Current Biology, 22(22), R943–R944. doi:10.1016/j.cub.2012.10.029

The paper concludes:

With a biodiversity crisis that predicts massive extinctions and a shelf life that will continue to reach several decades, taxonomists will increasingly be describing from museum collections species that are already extinct in the wild, just as astronomers observe stars that vanished thousands of years ago.

This is a conclusion that merits more investigation, especially as the title of the paper suggests there is an appalling lack of efficiency (or resources) in the way we decsribe biodiversity. So, with interest I looked at the Supplemental Information for the data:

I was hoping to see the list of the 600 species chosen at random, the publication containing their original description, and the date of their first collection. Instead, all we have is a description of the methods for data collection and analysis. Where is the data? Without the data I have no way of exploring the conclusions, asking additional questions. For example, what is the distribution of date of specimen collection in each species? One could imagine situations where a number of specimens are recently collected, prompting recognition and description of a new species, and as part of that process rummaging through the collections turns up older, unrecognised members of that species. Indeed, if it takes a certain number of specimens to describe a species (people tend to frown upon descriptions based on single specimens) perhaps what we are seeing is the outcome of a sampling process where specimens of new species are rare, they take a while to accumulate in collections, and the distribution of collection dates will have a long tail.

These are the sort of questions we could have if we had the data, but the authors don't provide that. The worrying thing is that we are seeing a number of high-visibility papers that potentially have major implications for how we view the field of taxonomy but which don't publish their data. Another recent example is:

Joppa, L. N., Roberts, D. L., & Pimm, S. L. (2011). The population ecology and social behaviour of taxonomists. Trends in Ecology & Evolution, 26(11), 551–553. doi:10.1016/j.tree.2011.07.010

Biodiversity is a big data science, it's time we insisted on that data being made available.

Monday, October 22, 2012

Resolving free-form citations

Cms logoCrossRef have released CrossRef Metadata Search a nice tool that can take a free-form citation and return possible matches from CrossRef's database. If you get a match CrossRef can take the DOI and format for you it in a variety of styles using DOI content negotiation.

If, like me, you spend a lot of time trying to find DOIs (and other identifiers) for articles by first parsing citations into their component parts, then this is good news. It's also good news for publishers that may balk at one of CrossRef's requirements for joining its club: if you want DOIs for your articles it's not enough to submit metadata for your article, you also need to submit the list of references that article cites, including their DOIs. This requirement enables CrossRef to offer their "cited by" service, but imposes a burden on smaller journals operating on a tight budget (e.g., Zootaxa). With CrossRef Metadata Search you can just send author-supplied citation strings from the manuscript and have a good chance of finding the corresponding DOI, if it exists.

Of course, the service only works if the article has a DOI, so it's not a complete solution to being able to parse bibliographic citations into their component parts. But it's a nice model, and I'm tempted to apply the same approach to my databases, such as BioStor or my ever growing Mendeley library (which is larger than the Mendeley desktop client can easily handle). A quick way to do this would be to use Cloudant which has cloud-based CouchDB coupled with a Lucene-based fulltext search engine. If I've time I may try and put a demo together.

Friday, October 19, 2012

The failure of phylogeny databases

It is well known that phylogeny databases such as TreeBASE capture a small fraction of the published phylogenies. This raises the question of how to increase the number of trees that get archived. One approach is compulsion:

In other words:
  1. Databasing trees is the Right Thing™ to do
  2. Few people are doing the Right Thing™
  3. This is because those people are bad/misguided and must be made to see the light

I want to suggest an alternative explanation:
  1. It is not at all obvious that databasing trees is useful
  2. The databases we have suck
  3. There's no obvious incentive for the people producing trees to database them
Why do we need a database of trees?

That we don't have a decent, widely used database of trees suggests that the argument still has to be made. Way back in the mid 1990's when TreeBASE was first starting I was at Oxford University and Paul Harvey (coauthor of The Comparative Method in Evolutionary Biology) was sceptical of its merits. Given that the comparative method depends on phylogenies, and people like Andy Purvis were in the Harvey lab building supertrees ( this may seem odd (it certainly did to me) but Paul shared the view of many systematists. Phylogenies are labile, they change with increased data and taxon sampling, hence individual trees have a short life span.

Data, in contrast, is long-lived. You'd happily reuse GenBank sequences published a decade ago, you probably wouldn't use a decade-old phylogeny. I made this point in an earlier post about the data archive Dryad (Data matters but do data sets?). A problem facing packages of data (such as papers, data sets, and phylogenies) is that the package itself may be of limited interest, beyond reproducing earlier results and benchmarking. In the case of phylogenies, if someone has a tree ((a,b),c) and someone else has a tree ((d,e),f), it's not obvious that we can combine these. But if we have sequences for the same gene from the same six taxa we can build a larger tree, say (((a,d),(b,e)),(c,f)).

I think this is part of the reason why GenBank works. Yes, there is compulsion (it's very hard to publish on sequences if you haven't deposited the data in GenBank), but there are clear benefits of depositing data. As the database grows we can do bigger analyses. If you are trying to identify a species based on its DNA, the chances are that the nearest sequence will have been deposited by somebody else. By depositing data your work it also lasts longer than if people just had the paper (your tree is likely to be outdated, that sequence from a rare, hard to obtain species might be used for decades to come).

Note that I'm not saying a database of trees isn't a good idea, but there seems to be an assumption that it is so obvious that it doesn't need justification. Demonstrably this isn't the case. Maybe we should figure out what we'd want to do with such a database, then tackle how we'd make that possible. For example, I'd want to query a phylogeny database geographically (show me trees from this part of the globe), by ecological association (find the trees for any parasites on this clade), by temporal period (what clades originated in the Miocene?), by data (what trees used this sequence which we now know is chimeric?), by topology (have we settled on the sister group to snakes yet?), and so on. I would also argue that much of this is doable, but might not actually require archiving published phylogenies. Personally I think anybody tackling these questions would do well to use PhyLoTA as their starting point.

TreeBASE sucks

Yes, I'm as sick of saying this as you are of reading it. But it doesn't change the fact that just about everything about TreeBASE from the complexity of the underlying data model, the choice of programming language, the use of a Java applet to display trees, the Byzantine search interface, and the voluminous XML output make TreeBASE a bag of hurt. None of this would matter much if it was an indispensable part of people's research toolkit, but this isn't the case. If you are trying to convince people of the benefits of sharing trees you really want a tool that makes a it seem a no brainer. We aren't there yet.

The "fuck this" point

In a great post on the piracy threshold, Matt Gemmell argues that piracy is largely the fault of content providers because they make being honest too difficult. How many times have you wanted to buy something such as a book or a movie only to discover that the content provider doesn't sell it in your part of the world (e.g., in the iBooks store in the US but not the UK) or doesn't provide it in the media you want (e.g., DVD but not online)? To top it off every time you go to the movies you are subjected to emotional blackmail or threats of unlimited fines if you were to copy the movie you already paid to watch?

6892585935 32d4e21e77 o

I think databases have the same "fuck this" threshold. If you are asking people to submit data you want to make it as easy as possible. And you want at least some of the benefits to be immediate and obvious. Otherwise you are left with coercing people, and that's being, at best, lazy.

If you want an example of how to do it right, look at Mendeley's model. They want to build a public cloud of academic papers, a laudable goal, the Right Thing™ to do. But they sell the idea not as a public good, not as the Right Thing™, nor by trying to compel people (they can't, they're a private company). Instead they address a major point of pain - where the hell did I put that PDF? - and make it trivial to organise your collection of articles. Then they make it possible to back them up to the cloud, to view them on multiple devices, to share them, and viola, we get a huge database of publications. The sociology works. So, my question is, what would the equivalent be for phylogenetics?

Wednesday, October 17, 2012


6e1f1693ed5d70b2f495e9a2c8666114 reasonably smallJames Rosindell's OneZoom tree viewer is out and the paper describing the viewer has been published in PLoS One (disclosure, I was a reviewer):

Rosindell, J., & Harmon, L. J. (2012). OneZoom: A Fractal Explorer for the Tree of Life. PLoS Biology, 10(10), e1001406. doi:10.1371/journal.pbio.1001406.g004
Below is a video where James describes OneZoom.

OneZoom is fun, and is deservedly attracting a a lot of attention. But as visually striking as it is, I confess I have reservations about fractal-based viewers. For a start they make it hard to get a sense of the relative size of taxonomic groups. Looking at the mammal tree shown in the video above your eye is drawn to the monotremes, one of the smallest mammalian lineages. That the greatest number of extant mammals are either rodents or bats is not readily apparent. Fractal geometry also removes the timescale, so you can't discover whether radiations in different clades are the same age (unlike, say, if the tree was drawn in a "traditional" fashion with a linear timescale). In some ways I think fractal viewers are rather like the hyperbolic viewers that attracted attention about a decade ago - visually striking but ultimately difficult to interpret. What I'd like to see are studies which evaluate how easily people can navigate different trees and accomplish specific tasks (such as determining closest relationships, relative clade diversity, etc.).

HypviewerIn some ways OneZoom resembles Google Maps with its zoomable interface. But ironically this only serves to illustrate a key different between OneZoom and Google Maps. Part of the strength of the later is the consistent conventions for drawing maps (e.g., north is up, south is down) which, when coupled with agreed co-ordinates (latitude and longitude), enables people to mash up geographic data. What I'd like is the equivalent of CartoDB for trees.

Friday, October 12, 2012

Mapping evolutionary biology: @evoldir and #ProjectEvoMap

Robert M. Griffin (@GriffinEvo) has launched ProjectEvoMap. Rob explains:
I have decided this week to try to create a resource where evolutionary
biologists can find info on labs and groups from all around the world. I
have created a collaborative Google map online which evolutionary biology
research groups can pin their labs to with a brief description of their
interests. Others can then browse the map to look for labs in specific
areas – for example, if someone wants to find suitable labs in their
current country for work they can see all the labs in that area, likewise
anyone looking for work in a specific region or who needs access to labs
while on fieldwork can look for nearby groups which may be able to help.

Below is a screen shot of part of the map. If you're working on evolutionary biology now is your chance to literally put your lab on the map.

In parallel I'm experimenting with adding a map to the venerable EvolDir mailing list, for which I run a twitter stream (@evoldir). Using some terribly crude code to extract what looks like an address from EvolDir posts, then calling Google's Geocoding API results in a map of recent posts. You can see the live map at This service compliments Rob's by giving a sense of current activity in the community (e.g., conferences, courses, jobs).


Friday, September 28, 2012

Reading the Biodiversity Heritage Library using Readmill

Readmill reasonably smalltl;dr Readmill might be a great platform for shared annotation and correction of Biodiversity Heritage Library content.

Thinking about accessing the taxonomic literature I started revisiting previous ideas. One is DeepDyve (see DeepDyve - renting scientific articles). Imagine not having to pay large sums for an article, but being able to rent it. Yes, open access would be great, but ultimately it's all a question of money (who pays and when), the challenge is to find the mix of models that encourage people to digitise the relevant literature. Instead of publishers insisting we pay $US30 for an article, how about renting it for the short time we actually need to read it?

Another model is, a Kickstarter-like company that seeks to raise funds to digitise and make freely available e-Books. has campaigns where people pledge donations, and if sufficient pledges are made the book's rights-holder has the book digitised and released DRM-free.

Looking at I stumbled across Readmill, "a curious community of readers, highlighting and sharing the books they love." Readmill has an iPad app where you can highlight passages of text and add your own annotation. These annotations can be shared, and multiple people can read and comment on the same book. Imagine doing this on BHL content. You could highlight parts of the text where the OCR has failed, and provide a correction. You could highlight taxonomic names that automatic parsers have missed, geographic localities, cited literature, etc. All within a nice, social app.

Even better, Readmill has an API. You can retrieve highlights and comments on those highlights. So, if someone flags a sentence as mangled OCR and provides a correction, that correction could be harvested and feed back to, say, BHL. These corrections could be used to improve searches, as well as the text delivered when generating searchable PDFs, etc.

You can even add highlights via the API, so we could upload a ePub book then add all the taxonomic names found by uBio or NetiNeti, enabling users to see which bits of text are probably names, correcting any mistakes along the way. Instead of giving readers a blank canvas they could already have annotations to start with.

Building an app from scratch to read and annotate BHL content would be a major undertaking. From my cursory initial look I wonder if Readmill might just provide the platform we need to clean up and annotate key parts of the BHL corpus?

Monday, September 24, 2012

Towards a biogeographic search engine

We all have a "past" that we might not advertise widely, and my past includes flirting with panbiogeography. Indeed my PhD thesis hdl:2292/1999 is entitled "Panbiogeography: a cladistic approach." Shortly after graduating I moved on to host-parasite cospeciation and the gene tree/species tree problem ("reconciled trees", see Katz et al. for a recent example of this approach), but part of me misses the glory days of vicariance, dispersal, and panbiogeography.

One thing which strikes me is how little use large-scale historical biogeography makes of GBIF data. One of the things that made Croizat's panbiogeography so interesting was the way he exposed similar distribution patterns in unrelated groups of organisms. He did this by hand, producing map after map, some embellished with all manner of annotations ("gates", "nodes", "massings", etc.). In some ways, Croizat as an early data miner. Now we are awash in distributional data, where are the people revisiting global scale historical patterns? In particular, wouldn't it be cool to have a biogeographic search engine that could pull out taxa with particular distribution patterns that we could then analyse.

For example, while working on a project to map taxonomic names to literature and genomics data, I embedded a widget to display GBIF maps. Every so often I come across taxa have the classic "Gondwana" distribution pattern. For example, below is a map for stoneflies of the family the Notonemouridae from GBIF.

Below is a map for the Notonemouridae using an orthographic projection (see earlier post for details):

Another family of stone flies, the Gripopterygidae, show a similar pattern:


What I'd like is to be able to query a database like GBIF for patterns such as these Gondwanic distributions, then be able to pull out associated phylogenetic information (e.g., via sequences in GenBank) so that we could determine the antiquity of these patterns, and whether they are consistent with geological models. We could begin to do large-scale testing of biogeographic hypotheses in a (semi-)automated way. At present we generally rely on a few well-studied examples that are either broadly consistent with
Bocxlaer, I. V., Roelants, K., Biju, S. D., Nagaraju, J., & Bossuyt, F. (2006). Late Cretaceous Vicariance in Gondwanan Amphibians. (M. Hofreiter, Ed.)PLoS ONE, 1(1), e74. doi:10.1371/journal.pone.0000074.t002

or contradict
Cook, L. G., & Crisp, M. D. (2005). Not so ancient: the extant crown group of Nothofagus represents a post-Gondwanan radiation. Proceedings of the Royal Society B: Biological Sciences, 272(1580), 2535–2544. doi:10.1098/rspb.2005.3219

the hypothesis that the history of biota of the southern hemisphere has been largely structured by the break-up of Gondwana.

A first step might be to index distributions at, say, family level and above, and provide a series of polygons representing different distribution patterns. We then search for distributions that are largely concordant with those patterns, and query GenBank (or TreeBASE) for sequences (or phylogenies) for those taxa. We then ask the questions "how old are these taxa?" and "what biogeographic histories do they have?"

Saturday, September 22, 2012

Touching the tree of life

Prompted by a conversation with Vince Smith at the recent Online Taxonomy meeting at the Linnean Society in London I've been revisiting touch-based displays of large trees. There are a couple of really impressive examples of what can be done.

Perceptive Pixel

I've blogged about this before, but came across another video that better captures the excitement of touch-based navigation of a taxonomy. Perceptive Pixel's (recently acquired by Microsoft) Jeff Han demos browsing an animal classification. The underlying visualisation is fairly sttaightforward, but the speed and ease with which you can interact with it clearly makes it fun to use.


DeepTree comes from Life on Earth lab, and there's a paper coming out by @blockflorian and colleagues (I was reminded of this project by @treevisproject):

For technical details on the layout algorithm see Below is a video of it in use:

Both of these are really nice, but what I really want is to have this on my iPad…

Saturday, September 08, 2012

Decoding Nature's ENCODE iPad app - OMG it's full of ePUB

The release of the ENCODE (ENCyclopedia Of DNA Element) project has generated much discussion (see Fighting about ENCODE and junk). Perhaps perversely, I'm more interested in the way Nature has packaged the information than the debate about how much of our DNA is "junk."

Nature has a website ( that demonstrates the use of "threads" to navigate through a set of papers. Instead of having to read every paper you can pick a topic and Nature has collected a set of extracts on that topic (such as a figure and its caption) from the relevant papers and linked them together as a thread. Here is a video outlining the rationale behind threads.

Threads can be viewed on Nature's web site, and also in the iPad app. The iPad app is elegant, and contains full text for articles from Nature, Genome Research, Genome Biology, BMC Genetics. Despite being from different journals the text and figures from these articles are displayed in the same format in the app. Curious as to how this was done I "disassembled" the iPad app (see Extract and Explore an iOS App in Mac OS X for how to do this. If you've downloaded the app on your iPad and synced the iPad with your Mac, then the apps are in the folder "iTunes/iTunes Media/Mobile Applications" folder inside your "Music" folder. The app contains a file called, and inside that folder are the articles and threads, all as ePub files. ePub is the format used by a number of book-reading apps, such as Apple's iBooks. Nature has a lot of experience with ePub, using it in their iPhone and iPad journal apps (see my earlier article on these apps, and my web-based clone for more details).

ePub has several advantages in this context over, say, PDFs. Because it ePUb is essentially HTML, the text and images can be reflowed, and it is possible to style the content consistently (imagine how much clunkier things would have looked if the app had used PDFs of the articles, each in the different journals' house style). Having the text in ePub also makes creating threads easy, you simply extract the relevant chunks and combine them into a new ePub file.

Threads are an interesting approach, particularly as they cut across the traditional boundaries of individual articles to create a kind of "mash up." Of course, in the ENCODE app these are preselected for you, you can't create your own thread. But you could imagine having an app that would enable you to not just collect the papers relevant to a topic (as we do with bibliographic software), but enable you to extract the relevant chunks and create a personalised mash up across papers from multiple journals, each linked back to the original article (much like Ted Nelson envisioned for the Xanadu project). It will be interesting to see whether thread-like approaches get more widely adopted. Whatever happens, Nature are consistently coming up with innovative approaches to displaying and navigating the scientific literature.

Wednesday, September 05, 2012

BHL is duplicating DOIs because it doesn't know about articles

Quick note that as much as I like that the Biodiversity Heritage Library is using DOIs, they are generating them for publications that already have them (or are acquiring them from other sources). For example, here are the two DOIs for the same article (formatted using the DOI Citation Formatter), one from BHL and one from the Smithsonian:

Springer, V. G. (1982). Pacific Plate biogeography, with special reference to shorefishes / Victor G. Springer. Smithsonian Institution. doi:10.5962/bhl.title.37141
Springer, V. G. (1982). Pacific Plate biogeography, with special reference to shorefishes. Smithsonian Contributions to Zoology, (367), 1–182. doi:10.5479/si.00810282.367

The BHL DOI resolves to a page in BHL, the other DOI resolves to the a page in the Smithsonian Digital Repository (this article also has the handle hdl:10088/5222).

Now this is a problem, because DOIs are meant to be unique: one article, one DOI. I've encountered duplicates elsewhere, but in these cases one should be an alias of the other. In the example above, the DOIs resolve to different locations. If you are just after the content this isn't a huge problem, but if, say, you were using the DOI to uniquely identify the publication (say, in a database) you have a problem: which DOI to choose? If you and I choose differently then we will make statements about the same article but be unaware of that sameness.

Much of this problem arises because BHL has no concept of articles. Most articles are likely to reside within scanned volumes of a journal, but some articles (e.g., monographs) may be treated a single title by BHL, and each BHL title now gets a DOI.

I know that handling articles is on BHL's radar, but it because it hasn't tackled it yet we are going to have cases where BHL DOIs duplicate existing DOIs. In these cases, BHL may have to make their DOI an alias of the other DOI.

Thursday, August 02, 2012

Google Knowledge Graph using data from BBC and Wikipedia

Google's Knowledge Graph can enhance search results by display some structured information about a hit in your list of results. It's available in the US (i.e., you need to use, although I have seen it occasionally appear for

Here is what Google displays for Eidolon helvum (the straw-coloured fruit bat). You get a snippet of text from Wikipedia, and also a map from the BBC Nature Wildlife site. Wikipedia is a well-known source of structured data (in that you can mine the infoboxes for information). The BBC site has some embedded RDFa and structured HTML, and you can also get RDF (just append ".rdf" to the URL, i.e., There doesn't seem to be anything in the RDF about the distribution map, so presumably Google are extracting that information from the HTML.

It would be interesting to think about what other biodiversity data providers, such as GBIF and EOL could do to get their data incorporated into Google's Knowledge Graph, and eventually into these search result snippets.

Tuesday, July 24, 2012

Dear GBIF, please stop changing occurrenceIDs!

If we are ever going to link biodiversity data together we need to have some way of ensuring persistent links between digital records. This isn't going to happen unless people take persistent identifiers seriously.

I've been trying to link specimen codes in publications to GBIF, with some success, so imagine my horror when it started to fall apart. For example, I recent added this paper to BioStor:

A remarkable new asterophryine microhylid frog from the mountains of New Guinea. Memoirs of The Queensland Museum 37: 281-286 (1994)

This paper describes a new frog (i>Asterophrys leucopus) from New Guinea, and BioStor has extracted the specimen code QM J58650 (where "QM" is the abbreviation for Queensland Museum), which according to the local copy of GBIF data that I have, corresponds to Unfortunately, if you click on that link GBIF denies all knowledge (you get bounced to the search page). After a bit of digging I discover that specimen is now in GBIF as At some point GBIF has updated its data and the old occurrenceID for QM J58650 (363089399) has been deleted. Noooo!

Looking at the old record I have there is an additional identifier:
urn:catalog:QM: Herpetology:J58650

This is a URN, and it's (a) unresolvable and (b) invalid as it contains a space. This is why URNs are useless. There's no expectation they will be resolvable hence there's no incentive to make sure they are correct. It's as much use as writing software code but not bothering to run it (because surely it will work, no?).

The GBIF record contains a UUID as an alternative identifier:

If you Google this you discover a record in the Atlas of Living Australia, which also lists the URN from the now deleted GBIF record

I'm guessing that at some point the OZCAM data provided to GBIF was updated and instead of updating data for existing occurrenceIDs the old ones were deleted and new ones created (possibly because OZCAM switched from URNs to UUIDs as alternative identifiers). Whatever the reason, I will now need to get a new copy of GBIF occurrence data and repeat the linking process. Sigh.

If we are ever going to deliver on the promise of linking biodiversity data together we need to take identifiers seriously. Meantime I need to think about mechanisms to handle links that disappear on a whim.

Monday, July 23, 2012

Microbiome as climate, macrobiome as weather, and a global model of biodiversity

Lp attenboroughHalf-baked idea time. Thinking about projects such as the Earth Microbiome Project and Genomic Observatories, the recent GBIC2012 meeting (I'm still digesting that meeting), and mulling over the book A Vast Machine I keep thinking about the possible parallels between climate science and biodiversity science.

One metaphor from "A Vast Machine" is the difference between "global data" and "making data global". Getting data from around the world ("global data") is one challenge, but then comes making that data global:
building complete, coherent, and consistent global data sets from incomplete, inconsistent, and heterogeneous datasources

The focus of GBIF's data portal is global data, bringing together specimen records and observations from all around the world. This is global data, but one could argue that it's not yet ready to be used for most applications. For example, GBIF doesn't give you the geographic distribution of a given species, merely where it's been recorded from (based on that subset of records that have been digitised). That's a very important start, but if we had for each species an estimated distribution based on museum records, observations, published maps, together with habitat modelling, then we'd be closer to a dataset that we could use to tackle key questions about the distribution of biodiversity.

EMP green smallBut if we continue with the theme that microbiology is the dark matter of biology, and if we look at projects like the Earth Microbiome Project, then we could argue that focussing on eukaryote, particularly macro-eukaryote such as plants, fungi, and animals, may be a mistake. To use a crude analogy, perhaps we have been focussing on the big phenomena (equivalent to thunder storms, flash floods, tornados, etc.) rather than the underlying drivers (equivalent to climatic processes such as those captured in global climate models). Certainly, any attempt to model the biosphere is going to have to include the microbiome, and indeed perhaps the microbiome would be enough to have a working model of the biosphere?

I'm simply waving my arms around here (no, really?), but it's worth thinking about whether the macroecology that conservation and biodiversity focusses on is actually the important thing to consider if you want to model fundamental biological processes. Might macro-organisms be like the weather, and the microbiome is like the climate. As humans we notice the weather, because it is at a scale that affects us directly. But if the weather is a (not entirely predictable) consequence of the climate, what is the equivalent of global climate model for biodiversity?

Friday, July 20, 2012

Figshare and F1000 integrate data into publication: could TreeBASE do the same?

Spiralsticker reasonably smallQuick thoughts on the recent announcement by figshare and F1000 about the new journals being launched on the F1000 Research site. The articles being published have data sets embedded as figshare widgets in the body of the text, instead of being, say, a static table. For example, the article:

Oliver, G. (2012). Considerations for clinical read alignment and mutational profiling using next-generation sequencing. F1000 Research. doi:10.3410/f1000research.1-2.v1
has a widget that looks like this:

You can interact with this widget to view the data. Because the data are in figshare those data are independently citable, e.g. the dataset "Simulated Illumina BRCA1 reads in FASTQ format" has a DOI

Now, wouldn't it be cool if TreeBASE did something similar? Imagine if uploading trees to TreeBASE were easy, and that you didn't have to have published yet, you just wanted to store the trees and make them citable. Imagine if TreeBASE had a nice tree viewer (no, not a Java applet, a nice viewer that uses SVG, for exmaple). Imagine if you could embed that tree viewer as a widget when you published your results. It's a win all round. People have an incentive to upload trees (nice viewer, place to store them, and others can cite the trees because they'd have DOIs). TreeBASE builds its database a lot more quickly (make it dead easy to upload tree), and then as more publishers adopt this style of publishing TreeBASE is well placed to provide nice visualisations of phylogenies pre-packaged, interactive, and citable. And let's not stop there, how about a nice alignment viewer? Perhaps this is the something currently rather moribund PLoS Currents Tree of Life could think about supporting?

Tuesday, July 17, 2012

Building a BHL Africa: BHL in a box

Was going to post this as a comment on the BHL blog but they use Blogger's native comment system, which is horrible, and it refused to accept my comment (yes, yes, I'm sure it did that on grounds of taste). I read the recent post Building a BHL Africa and couldn't believe my eyes when I read the following:

the "BHL in a Box" concept was highly desired. This would entail creating interactive CDs of BHL content for distribution in areas where internet access is unreliable or unavailable.
CDs! Really? Surely this is crazy!?. You want to use an obsolete technology that require additional obsolete technology to ship BHL around Africa? Why not ship relevant parts of BHL on iPads? Lots more storage space than CDs, built-in interactivity (obviously need to write an app, but could use HTML + Javascript as a starting point), long battery life, portable, comes with 3G support if needed. I'll be the first to admit that my knowledge of Africa is about zero, but given that mobile devices are common, mobile networks are fairly well developed, and tablets are making inroads (see iPad has become a big factor in African business) surely "BHL mobile" is the way to go to provide "BHL in a box", not CDs.

Why not develop an app that stores BHL content on a device like an iPad, then distribute those? Support updating the content over the network so the user isn't stuck with content they no longer need. In effect, something like Amazon's Kindle app or iBooks would do the trick. You'd need to compress BHL content to keep the size down (the images BHL currently displays on its web site could be made a lot smaller) but this is doable. Indeed, the BHL Africa could be an ideal motivation to move BHL to platforms such as phones and tablets, where at the moment users have to struggle with a website that makes no concessions to those devices.

Of course, it doesn't have to be the iPad as such. Imagine if BHL published books and articles on Amazon, then used Kindle to deliver content physically (i.e., ship Kindles), and anyone else could access it directly from Amazon using their Kindle (or Kindle app on iPad).

Friday, July 13, 2012

Sometimes the mess taxonomy creates drives me nuts

Playing with some sequence data I found numerous Plasmodium sequences from the following paper:

Werner, E. B. ., Taylor, W. R., & Holder, A. A. (1998). A Plasmodium chabaudi protein contains a repetitive region with a predicted spectrin-like structure1Note: Nucleotide sequence data reported in this paper are available in the EMBL, GenBank™ and DDJB databases under the accession number U43145.1. Molecular and Biochemical Parasitology, 94(2), 185–196. doi:10.1016/S0166-6851(98)00067-X

These sequences (e.g., U43145) give the host as Thamnomys rutilans. You'd think it would be fairly easy to learn more about this animal, given that it hosts a relative of the cause of malaria in humans, and indeed there are a number of biomedical papers that come up in Google, e.g.:

Landau, I., & Chabaud, A. (1994). Advances in Parasitology (Vol. 33, pp. 49–90). Elsevier BV. doi:10.1016/S0065-308X(08)60411-X
Killick-Kendrick, R. (1968). Malaria parasites of Thamnomys rutilans (Rodentia, Muridae) in Nigeria. Bull World Health Organ. 1968; 38(5): 822–824. PMC2554675

Google also tells me that Thamnomys rutilans is an African rodent (e.g., 6.1.6. Rodent malaria, but NCBI has no sequences for "Thamnomys rutilans", and GBIF has no data on its distribution. If I search Mammal Species of the World I get (literally) "nothing found ...".

So, this is an African rodent, host to Plasmodium, and we know nothing about it? A bit of Googling, a trip to Wikipedia and Google Books reveals that Thamnomys rutilans is a synonym of Grammomys rutilans, but it is now called Grammomys poensis because the original name (Mus rutilans Peters 1876) is a junior synonym homonym of Mus rutilans Olfers, 1818 (simples). You can see the original description of Mus rutilans Peters 1876 in BioStor (this took some tracking down, but that's another story):


The original description of Mus rutilans Olfers, 1818 is given by The description of a new species of South American hocicudo, or long-nose mouse, genus Oxymycterus (Sigmodontinae, Muroidea), with a critical review of the generic content as:

Olfers, I. 1818. Bemerkungen zu Illiger's Ueberblick der Saugethiere nach ihrer Betheilung über die Welttheile rüchsichtlich der Südamerikanischen Arten (Species). In Eschwege, W. L., ed., Journal von Brasilien, Weimar, 15(2): 192-237.

This reference doesn't seem to be online.

The upshot of all this information about the host of Plasmodium chabaudi is hidden behind taxonomic name changes, and databases that one might expect to help simply don't. If names are the glue that link biodiversity data together then we need to get a lot better at making basic information about name changes accessible, otherwise we are creating islands of disconnected data.

Thursday, July 12, 2012

Dimly lit taxa - guest post by Bob Mesibov

The following is a first for iPhylo, a guest post by Bob Mesibov. Bob

Rod Page introduced 'dark taxa' here on iPhylo in April 2011. He wrote:

The bulk of newly added taxa in GenBank are what we might term "dark taxa", that is, taxa that aren't identified to a known species. This doesn't necessarily mean that they are species new to science, we may already have encountered these species before, they may be sitting in museum collections, and have descriptions already published. We simply don't know. As the output from DNA barcoding grows, the number of dark taxa will only increase, and macroscopic biology starts to look a lot like microbiology.

Rod suggested that 'quite a lot' of biology can be done without taxonomic names. For the dark taxa in GenBank, that might well mean doing biology without organisms – a surprising thought if you're a whole-organism biologist.

Non-taxonomists may be surprised to learn that a lot of taxonomy is also done, in fact, without taxonomic names. Not only is there a 'dark taxa' gap between putative species identified genetically and Linnaean species described by specialists, there's a 'dimly lit taxa' gap between the diversity taxonomists have already discovered, and the diversity they've named.

Dimly lit taxa range from genera and species given code names by a specialist or a group of collaborators, and listed by those codes in publications and databases, to potential type specimens once seen and long remembered by a specialist who plans to work them up in future, time and workload permitting.

In that phrase 'time and workload permitting' is a large part of the explanation for dimly lit taxa. Over the past month I created 71 species of this kind myself. Each has been code-named, diagnostically imaged, databased and placed in code-labelled bottles on museum shelves. The relevant museums have been given digital copies of the images and data.

The 71 are 'species-in-waiting'. They aren't formally named and described, but specialists like myself can refer to the images and data for identifying new specimens, building morphological and biogeographical hypotheses, and widening awareness of diversity in the group to which the 71 belong.

'Time and workload permitting'. Many of the 71 are poor-quality or fragmented museum specimens from which important morphological data, let alone sequences, cannot be obtained. Fresh specimens are needed, and fieldwork is neither quick nor easy. In my special corner of zoology, as in most such corners in zoology and botany, the widespread and abundant species are all, or nearly all, named. The unnamed rump consists of a huge diversity of geographically restricted and uncommon species. There are more than 71 in that group of mine; those are just the rare species I know about, so far.

'Time and workload permitting'. A non-taxonomist might ask, 'Why don't you just name and describe the 71 briefly, so that the names are at least available, and the gap between what's known and what's named is narrowed?' The answer is simple: inadequate descriptions are the bane of taxonomy. There are hundreds of species in my special group that were named and inadequately described long ago, and which wind up on checklists of names as 'nomen dubium' and 'incertae sedis'. Clearing up the mysteries means locating the types (which hopefully still exist) and studying them. That slow and tedious study would better have been done by the first describer.

Cybertaxonomic tools can help bring dimly lit taxa into full light, but not much. The rate-limiting steps in lighting up taxa are in the minds and lives of human taxonomists coping with the huge and bewilderingly complex diversity of life. It's not the tools used after the observing and thinking is done, it's the observing and thinking.

In their article 'Ramping up biodiversity discovery via online quantum contributions' (, Maddison et al. argue that the pace of naming and description can be increased if information about what I've called dimly lit taxa is publicly posted, piece by piece, 'publish as you go', on the Internet. In my case, I would upload images and data for my 71 'species-in-waiting' to suitable sites and make them freely available.

Excited by these discoveries, amateurs and professionals would rush to search for fresh specimens. Specialists would drop whatever else they were doing, borrow the existing specimens of the 71 from their repositories and do careful inventories of the morphological features I haven't documented. Aroused from their humdrum phylogenetic analyses of other organisms, molecular phylogeny labs would apply for extra funding to work on my 71 dimly lit taxa. In no time at all, a proud team of amateurs and specialists would be publishing the results of their collaboration, with 71 names and descriptions.

Shortly afterwards, flocks of pigs would slowly circle the 71 type localities, flapping their wings in unison.

Memo to Maddison et al. and other would-be reformers: the rate of taxonomic discovery and documentation is very largely constrained by the supply of taxonomists. You want more names, find more namers.

Wednesday, July 11, 2012

Citations, Social Media & Science

Quick note that Morgan Jackson (@BioInFocus) has written nice blog post Citations, Social Media & Science inspired by the fact that the following paper:

Kwong, S., Srivathsan, A., & Meier, R. (2012). An update on DNA barcoding: low species coverage and numerous unidentified sequences. Cladistics, no–no. doi:10.1111/j.1096-0031.2012.00408.x

cites my "Dark taxa" in the body of the text but not in the list of literature cited. This prompted some discussion of DOIs and blog posts on Twitter:

Read Morgan's post for more on this topic. While I personally would prefer to see my blog posts properly cited in papers like doi:10.1111/j.1096-0031.2012.00408.x, I suspect the authors did what they could given current conventions (blogs lack DOIs, are treated differently from papers, and many publishers cite URLs in the text, not the list of references cited). If we can provide DOIs (ideally from CrossRef so we become part of the regular citation network), suitable archiving, and — most importantly — content that people consider worthy of citation then perhaps this practice will change.

Friday, July 06, 2012

Post GBIC2012 thoughts

I'm back from Copenhagen and GBIC2012. The meeting spanned three fairly intense days (with the days immediately before and after also working days for some of us), and was run by a group of facilitators lead by Natasha Walker, who were described us as "an interesting (and delightfully brainy, if sometimes scatty) group of academics, researchers, museum managers and people close to policy...". I've attempted to capture tweets about the meeting using Storify.

There will be a document (perhaps several) based on the meeting, but until then here are a few quick thoughts. Note that the comments below are my own and you shouldn't read into this anything about what directions the GBIC document(s) will actually take.

Microbiology rocks

Highlight of the first day was Robert J. Robbin's talk which urged the audience to consider that life was mostly microbial, that the the things most people in the room cared about were actually merely a few twigs on the tree of life, that the tree of life didn't actually exist anyway, and many of the concepts that made sense for multicellular organisms simply didn't apply in the microbial world. Basically it was a homage to Carl Woese (see also Pace et al. 2012 doi:10.1073/pnas.1109716109) and a wake up call to biodiversity informaticians to stop viewing the world through multicellular eyes. (You can find all the keynotes from the first day here).

F1 large
From Pace, N. R. (1997). A Molecular View of Microbial Diversity and the Biosphere. Science, 276(5313), 734–740. doi:10.1126/science.276.5313.734

Sequences rule

The future of a lot of biodiversity science belongs to sequences, from simple DNA barcoding as a tool for species discovery and identification, metabarcoding as a tool for community analysis, to comparisons of metabolic pathways and beyond. The challenge for classical biodiversity informatics is how to engage with this, and to what extent we should try and map between, say sequences and classical taxa, or whether it might make more sense (gasp) to abandon the taxonomic legacy and move on. Perhaps are more nuanced response is that the point of connection between sequences and classical biodiversity data is unlikely to be at the level of taxonomic names (which are mostly tags for collections of things that look similar) but at the level of specimens and observations.

Ontologies considered harmful

This is my own particular hobby horse. Often the call would come "we need an ontology", to which I respond read Ontology is Overrated: Categories, Links, and Tags. I have several problems with ontologies. The first is that they are too easy to make and distract from the real problem. From my perspective a big challenge is linking data together, that is going from




Let's leave aside what "A" and "B" are (I suspect it matters less than people think), once we have the link then we can can start to do stuff. From my perspective, what ontologies give us is basically this:


So now we know the "type" of the link (e.g., "is a part of", "cites approvingly", etc.). I'm not arguing that this isn't useful to have, but if you don't have the network of links then typing the links becomes an idle exercise.

To give an example, the web itself can be modelled as simply nodes connected by links, ignoring the nature of the links between the web pages. The importance of those links can be inferred later from properties of the network. To a first approximation this is how Google works, it doesn't ask what the links "mean" it simply investigates the connections to determine how important each web page is. In the same way, we build citation networks without bothering to ask the nature of the citation (yes I know there are ontologies for citations, but anyone willing to bet how widely they'll be adopted?).

My second complaint is that building ontologies is easy, "easy" in the sense that get a bunch of people together, they squabble for a long time about terminology, and out comes an ontology. Maybe, if you're lucky, someone will adopt it. The cost of making ontologies, and indeed of adopting them is relatively low (although it might not seem like it at the time). The cost of linking data is, I'd argue, higher, because it requires that you trust someone else's identifiers to the extent that you use them for things you care about deeply. Consider the citation network that is emerging from the widespread adoption of DOIs by the publishing industry. Once people trust that the endpoints of the links will survive, then the network starts to grow. But without that trust, that leap of faith, there's no network (unless you have enough resources to build the whole thing internally yourself, which is what happened with the closed citation network owned by Thomson Reuters). It's much easier to silo the data using unique identifiers than it is to link to other data (it's a variant of the "not invented here" syndrome).

Lastly, ontologies can have short lives. They reflect a certain world view that can become out of date, or supplanted if the relationships between things that the ontology cares about can be computed using other data. For example, biological taxonomy is a huge ontology that is rapidly being supplanted by phylogenetic trees computed from sequence (and other) data (compare the classification used by flagship biodiversity projects like GBIF and EOL with the Pace tree of life shown above). Who needs an ontology when you can infer the actual relationships? Likewise, once you have GPS the value of a geographic ontology (say of place names) starts to decline. I can compute if I'm on a mountain simply by knowing where I am.

I'm not saying ontologies are always bad (they're not), nor that they can't be cool (they can be), I'm just suggesting that they aren't the first thing you need. And they certainly aren't a prerequisite for linking stuff together.

Google flu trends

Perhaps the most interesting idea that emerged was the notion of intelligently detecting changes in biodiversity (which is the kind of thing a lot of people want to know) in the way analogous to's Flu Trends uses flu-related search terms to predict flu outbreaks:

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2008). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–1014. doi:10.1038/nature07634

Could we do something like this for biodiversity data? For various reasons this suggestion become known at GBIC2012 as the "Heidorn paradigm".

Thinking globally

One challenge for a meeting like GBIC 2012 is scope. There's so much cool stuff to think about. From my perspective, a useful filter is to ask "what will happen anyway?" In other words, there is a lot of stuff (for example the growth of metabarcoding) that will happen regardless of anything the biodiversity informatics community does. People will make taxon-specific ontologies for organismal traits, digitise collections, assess biodiversity, etc. without necessarily requiring an entity like GBIF. The key question is "what won't happen at a global scale unless GBIF (or some other entity) gets involved?"

A Vast Machine

51OttqQDcVL SL500 AA300Lastly, in one session Tom Moritz mentioned a book that he felt we could learn from (A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming). The book recounts the history of climatology and its slow transition to a truly global science. I've started to read it, and it's fascinating to see the interplay between early visions of the future, and the technology (typically driven by military or large-scale commercial interests) that made possible the realisation of those visions. This is one reason why predicting the future is such a futile activity, the things that have the biggest effect come from unexpected sources, and effect things in ways it's hard to anticipate. On a final note, it took about a minute from the time from the time Tom mentioned the book to the time I had a copy from Amazon in the Kindle app on my iPad. Oh that accessing biodiversity data were that simple.

Sunday, July 01, 2012

Using orthographic projections to map organism distributions

For a current project I'm currently working I show organism distributions using data from GBIF, and I display that data on a map that uses the equirectangular projection. I've recently started to create a series of base maps using the GBIF colour scheme, which is simple but effective:

  • #666698 for the sea
  • #003333 for the land
  • #006600 for borders
  • yellow for localities

The distribution map is created by overlaying points on a bitmap background using SVG (see SVG specimen maps from SPARQL results for details). SVG is ideally suited to this because you can take the points, plot them in the x,y plane (where x is longitude and y is latitude) then use SVG transformations to move them to the proper place on the map.

For the base maps themselves I've also started to use SVG, partly because it's possible to edit them with a text editor (for example if you want to change the colours). I then use Inkscape to export the SVG to a PNG to use on the web site.


One thing that has bothered me about the equirectangular projection is that, although it is familiar and easy to work with, it gives a distorted view of the world:

This is particularly evident for organisms that have a circumpolar distribution. For example, Kerguelen's petrel Aphrodroma has a distribution that looks like this using the equirectangular projection:


This long, thin distribution looks rather different if we display it on a polar projection:

Likewise, classic Gondwanic distributions such as that of Gripopterygidae become clearer on a polar projection.


Computing the polar coordinates for a set of localities is straightforward (see for example this page) and using SVG to lay out the points also helps, because it's trivial to rotate them so that they match the orientation of the map. Ultimately it would be nice to have an embedded, rotatable 3D globe (like the Google Earth plugin, or a Javascript+SVG approach like this). But for now I think it's nice to have the option of using different projections available to help display distributions more faithfully.

The bitmap maps and their SVG sources are available on github.

Friday, June 29, 2012

Planet management, GBIF, and the future of biodiversity informatics

Earth russia large verge medium landscape

Next week I'm in Copenhagen for GBIC, the Global Biodiversity Informatics Conference. The goal of the conference is to:
...convene expertise in the fields of biodiversity informatics, genomics, earth observation, natural history collections, biodiversity research and policy needed to set such collaboration in motion.

The collaboration referred to is the agreement to mobilise data and informatics capability to met the Aichi Biodiversity Targets.

I confess I have mixed feelings about the upcoming meeting. There will be something like 100 people attending the conference, with backgrounds ranging from pure science to intergovernmental policy. It promises to be interesting, but whether a clear vision of the future of biodiversity informatics will emerge is another matter.

GBIC is part of the process of "planet management", a phrase that's been around for a while, but I only came across in the Bowker's essay "Biodiversity Datadiversity"1:

Bowker, G. C. (2000). Biodiversity Datadiversity. Social Studies of Science, 30(5), 643–683. doi:10.1177/030631200030005001

Bowker's essay is well worth a read, not least for the choice quotes such as:

Each particular discipline associated with biodiversity has its own incompletely articulated series of objects. These objects each enfold an organizational history and subtend a particular temporality or spatiality. They frequently are incompletely articulated with other objects, temporalities and spatialities — often legacy versions, when drawing on non-proximate disciplines. If one wants to produce a consistent, long-term database of biodiversity-relevant information the world over, all this sounds like an unholy mess. At the very least it suggests that global panopticons are not the way to go in biodiversity data. (p. 675, emphasis added)

I have not, in general, questioned the mania to name which is rife in the circles whose work I have described. There is no absolutely compelling connection between the observation that many of the world’s species are dying and the attempt to catalogue the world before they do. If your house is on fire, you do not necessarily stop to inventory the contents before diving out the window. However, as Jack Goody (1977) and others have observed, list-keeping is at the heart of our body politic. It is also, by extension, at the heart of our scientific strategies. Right or wrong, it is what we do. (p. 676, emphasis added)

Given that I'm a fan of the notion of a "global panopticon", and spend a lot of time fussing with lists of names, I find Bowker's views refreshing. Meantime, roll on GBIC2012.

1. Bowker cites Elichirigoity as a source of the term "planet management":

Fernando Elichirigoity (1999), Planet Management: Limits to Growth,
Computer Simulations, and the Emergence of Global Spaces (Evanston, IL: Northwestern
University Press). ISBN 0810115875 (Google Books oP3wVnKpGDkC).

From the limited Google preview, and the review by Edwards, this looks like an interesting book:

Edwards, P. (2000). Book Review:Planet Management: Limits to Growth, Computer Simulation, and the Emergence of Global Spaces Fernando Elichirigoity. Isis, 91(4), 828. doi:10.1086/385020 (PDF here)