Wednesday, December 17, 2014

The Natural History Museum launches their data portal

XVlUOuC5The Natural History Museum has released their data portal ( As of now it contains 2,439,827 of the Museum's 80 million specimens, so it's still early days. I gather that soon this data will also appear in GBIF, ending the unfortunate situation where data from one of the premier natural history collections in the world was conspicuous by its absence.

I've not had a chance to explore it in much detail, but one thing I'm keen to do is see whether I can link citations of NHM specimens in the literature (e.g., articles in BioStor) with records in the NHM portal. Being able to dip this would enable all sorts of cool things, such as being able to track what researchers have said about particular specimens, as well as develop citation metrics for the collection.


Is DNA barcoding dead?

On a recent trip to the Natural History Museum, London, the subject of DNA barcoding came up, and I got the clear impression that people at the NHM thought classical DNA barcoding was pretty much irrelevant, given recent developments in sequencing technology. For example, why sequence just COI when you can use shotgun sequencing to get the whole mitogenome? I was a little taken aback, although this is a view that's getting some traction, e.g. [1,2]. There is also the more radical view that focussing on phylogenetics is itself less useful than, say, "evolutionary gene networks" based on massive sequencing of multiple markers [3].

At the risk of seeming old-fashioned in liking DNA barcoding, I think there's a bigger issue at stake (see also [4]). DNA barcoding isn't simply a case of using a single, short marker to identify animal species. It's the fact that it's a globalised, standardised approach that makes it so powerful. In the wonderful book "A Vast Machine" [5], Paul Edwards talks about "global data" and "making data global". The idea is that not only do we want data that is global in coverage ("global data"), but we want data that can be integrated ("making data global"). In other words, not only do we want data from everywhere in the world, say, we also need an agreed coordinate system (e.g., latitude and longitude) in order to put each data item in a global context. DNA barcoding makes data global by standardising what a barcode is (a given fragment of COI), and what metadata needs to be associated with a sequence to be a barcode (e.g., latitude and longitude) (see, e.g. Guest post: response to "Putting GenBank Data on the Map"). By insisting on this standardisation, we potentially sacrifice the kinds of cool things that can be done with metagenomics, but the tradeoff is that we can do things like put a million barcodes on a map:

To regard barcoding as dead or outdated we'd need an equivalent effort to make metagenomic sequences of animals global in the same way that DNA barcoding is. Now, it may well be that the economics of sequencing is such that it is just as cheap to shotgun sequence mitogenomes, say, as to extract single markers such as COI. If that's the case, and we can get a standardised suite of markers across all taxa, and we can do this across museum collections (like Hebert et al.'s [6] DNA barcoding "blitz" of 41,650 specimens in a butterfly collection), then I'm all for it. But it's not clear to me that this is the case.

This also leaves aside the issue of standardising other things's much as the metadata. For instance, Dowton et al. [2] state that "recent developments make a barcoding approach that utilizes a single locus outdated" (see Collins and Cruickshank [4] for a response). Dowton et al. make use of data they published earlier [7,8]. Out of curiosity I looked at some of these sequences in GenBank, such as JN964715. This is a COI sequence, in other words, a classical DNA barcode. Unfortunately, it lacks a latitude and longitude. By leaving off latitude and longitude (despite the authors having this information, as it is in the supplemental material for [7]) the authors have missed an opportunity to make their data global.

For me the take home message here is that whether you think DNA barcoding is outdated depends in part what your goal is. Clearly barcoding as a sequencing technology has been superseded by more recent developments. But to dismiss it on those grounds is to miss the bigger picture of what is a stake, namely the chance to have comparable data for millions of samples across the globe.


  1. TAYLOR, H. R., & HARRIS, W. E. (2012, February 22). An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. Molecular Ecology Resources. Wiley-Blackwell. doi:10.1111/j.1755-0998.2012.03119.x
  2. Dowton, M., Meiklejohn, K., Cameron, S. L., & Wallman, J. (2014, March 28). A Preliminary Framework for DNA Barcoding, Incorporating the Multispecies Coalescent. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu028
  3. Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biol Direct. Springer Science + Business Media. doi:10.1186/1745-6150-5-47
  4. Collins, R. A., & Cruickshank, R. H. (2014, August 12). Known Knowns, Known Unknowns, Unknown Unknowns and Unknown Knowns in DNA Barcoding: A Comment on Dowton et al. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu060
  5. Edwards, Paul N. A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. MIT Press ISBN: 9780262013925
  6. Hebert, P. D. N., deWaard, J. R., Zakharov, E. V., Prosser, S. W. J., Sones, J. E., McKeown, J. T. A., Mantle, B., et al. (2013, July 10). A DNA “Barcode Blitz”: Rapid Digitization and Sequencing of a Natural History Collection. (S.-O. Kolokotronis, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0068535
  7. Meiklejohn, K. A., Wallman, J. F., Pape, T., Cameron, S. L., & Dowton, M. (2013, October). Utility of COI, CAD and morphological data for resolving relationships within the genus Sarcophaga (sensu lato) (Diptera: Sarcophagidae): A preliminary study. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2013.04.034
  8. Meiklejohn, K. A., Wallman, J. F., Cameron, S. L., & Dowton, M. (2012). Comprehensive evaluation of DNA barcoding for the molecular species identification of forensically important Australian Sarcophagidae (Diptera). Invertebrate Systematics. CSIRO Publishing. doi:10.1071/is12008

Tuesday, December 09, 2014

Guest post: Top 10 species names and what they mean

The following is a guest post by Bob Mesibov. Bob

The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.

One of the first things I did with my download was check the 'taxa.txt' table for species name popularity*. Here they are, the top 10 species names for animals and plants, with their frequencies in the CoL list and their usual meanings:


2732 gracilis = slender
2373 elegans = elegant
2231 bicolor = two-coloured
2066 similis = similar
1995 affinis = near
1937 australis = southern
1740 minor = lesser
1718 orientalis = eastern
1708 simplex = simple
1350 unicolor = one-coloured


1871 gracilis = slender
1545 angustifolia = narrow-leaved
1475 pubescens = hairy
1336 parviflora = few-flowered
1330 elegans = elegant
1324 grandiflora = large-flowered
1277 latifolia = broad-leaved
1155 montana = (of a) mountain
1124 longifolia = long-leaved
1102 acuminata = pointed

Take the numbers cum grano salis. The first thing I did with the CoL tables was check for duplicates, and they're there, unfortunately. It's interesting, though, that gracilis tops the taxonomists' poll for both the animal and plant kingdoms.

*With the GNU/Linux commands

awk -F"\t" '($11 == "Animalia") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
awk -F"\t" '($11 == "Plantae") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head

Tuesday, December 02, 2014

GBIF Ebbe Nielsen Challenge

The GBIF Ebbe Nielsen Challenge is open! From the official announcement
The GBIF Secretariat has launched the inaugural GBIF Ebbe Nielsen Challenge, hoping to inspire innovative applications of open-access biodiversity data by scientists, informaticians, data modelers, cartographers and other experts.
First prize is €20,000, full details on prizes and entry requirements are on the Challenge web site. To judge the entries GBIF has assembled a panel of judges comprising people both inside and outside GBIF and its advisory committees:

Large  3 Lucas Joppa
 Scientist, Computational Ecology and Environmental Sciences Group / Microsoft Research
Large  2 Mary Klein
 President & CEO / NatureServe
Download Tanya Abrahamse
 CEO / SANBI: South African National Biodiversity Institute
Large  1 Arturo H. Ariño
 Professor of Ecology / University of Navarra
Large Roderic Page (that's me)
 Professor of Taxonomy / University of Glasgow

This is the first time we've run the challenge, so the topic is wide open. Below I've put together some ideas that are simply designed to get you thinking (and are in no way intended to limit the sort of things that could be entered).

400px GOS weighted unifrac fullEvolutionary trees
Increasingly DNA sequences from DNA barcoding and metabarcoding are being used to study biodiversity. How can we integrate that data into GBIF? Can we decorate GBIF maps with evolutionary trees?
GoogleforestChange over timeGlobal Forest Watch is an impressive example of how change in the biosphere can be monitored over time. Can we do something similar with GBIF data? Alternatively, if the level of temporal or spatial resolution in GBIF data isn't high enough, can we combine these sources in some way?
GBIF has started to provide
graphical summaries of its data
, and there is lots to be done in this area. Can we have a Google Analytics-style summary of GBIF data?

This merely scratches the surface of what could be done, and indeed one of the reasons for having the challenge is to start a conversation about what can be done with half a billion data records.

Sunday, November 23, 2014

Automatically extracting possible taxonomic synonyms from the literature

Quick notes on an experimental feature I've added to BioNames. It attempts to identify possible taxonomic synonyms by extracting pairs of names with the same species name that appear together on the same page of text. The text could be full text for an open access article, OCR text from BHL, or the title and abstract for an article. For example, the following paper creates a new combination, Hadwenius tursionis, for a parasite of the bottlenose dolphin. This name is a synonym of Synthesium tursionis.

Fernández, M., Balbuena, J. A., & Raga, J. A. (1994, July). Hadwenius tursionis (Marchi, 1873) n. comb. (Digenea, Campulidae) from the bottlenose dolphin Tursiops truncatus (Montagu, 1821) in the western Mediterranean. Syst Parasitol. Springer Science + Business Media. doi:10.1007/bf00009519

The taxonomic position of Synthesium tursionis (Marchi, 1873) (Digenea, Campulidae) is revised, based on material from 147 worms from four bottlenose dolphins Tursiops truncatus stranded off the Comunidad Valenciana (Spanish western Mediterranean). The species is transferred to Hadwenius, as H. tursionis n. comb., and characterised by a high length/width ratio of the body, spinose cirrus and unarmed metraterm. Synthesium, a monotypic genus, becomes a synonym of Hadwenius. The intraspecific variation of some morphological traits is briefly discussed.

If we extract taxonomic names from the title and abstract we have the pair (Synthesium tursionis, Hadwenius tursionis). If we do this across all the text currently in BioNames then we discover other pairs of names that include Synthesium tursionis, joining these together we can create a graph of co-occurrence of names that are synonyms (see Synthesium tursionis).

Synthesium tursionisHadwenius tursionisDicrocoelium tursionisDistomum tursionisOrthosplanchnus tursionisSynthesium (Orthosplanchnus) tursionis
These graphs are computed automatically, and there is inevitably scope for error. Taxa that are not synonyms may have the same specific name (e.g., parasites and hosts may have the same specific name), and some of the names extracted from the text may be erroneous. At the same time, anecdotally it is a useful way to discover links between names. Even better, this approach means that we have the associated evidence for each pair of names. The interface in BioNames lists the references that contain the pairs of names, so you can evaluate the evidence for synonymy. It would be useful to try and evaluate the automatically detected synonyms by comparisons with existing lists of synonyms (e.g., from GBIF).

Tuesday, October 21, 2014

On identifiers (again)

I'm going to the TDWG Identifier Workshop this weekend, so I thought I'd jot down a few notes. The biodiversity informatics community has been at this for a while, and we still haven't got identifiers sorted out.

From my perspective as both a data aggregator (e.g., BioNames) and a data provider (e.g., BioStor) there are four things I think we need to tackle in order to make significant progress.

Discoverability (strings to things)

A basic challenge is to go from strings, such as bibliographic citations, specimen codes, taxonomic names, etc., to digital identifiers for those things. Most of our data is not born digital, and so we spend a lot of time mapping strings to identifiers. For example, publishers do this a lot when they take the list of literature cited at the end of a manuscript and add DOIs. Hence, one of the first things CrossRef did was provide a discovery service for publishers. This has now morphed into a very slick search tool Without discoverabilty, nobody is going to find the identifiers in the first place.


Given an identifier it has to be resolvable (for both people and machines), and I'd argue that at least in the early days of getting that identifier accepted, there needs to be a single point of resolution. Some people are arguing that we should separate identifiers from their resolution, partly based on arguments that "hey, we can always Google the identifier". This argument strikes me as wrong-headed for a several of reasons.

Firstly, Google is not a resolution service. There's no API, so it's not scalable. Secondly, if you Google an identifier (e.g., 10.7717/peerj.190) you get a bunch of hits, which one is the definitive source of information on the thing with that identifier? It's not at all obvious, and indeed this is one of the reasons publishers adopted DOIs in the first place. If you Google a paper you can get all sorts of hits and all sorts of versions (preprint, manuscripts, PDFs on multiple servers, etc.). In contrast the DOI gives you a way to access the definitive version.

Another way of thinking about this is in terms of trust. At some point down the road we might have tools that can assess the trust worthiness of a source, and we will need these if we develop decent tools to annotate data (see More on annotating biodiversity data: beyond sticky notes and wikis). But until then the simplest way to engender trust is to have a single point of resolution (like for DOIs). Think about how people now trust DOIs. They've become a mark of respectability for journals (no DOIs, you're not a serious journal), and new ideas such as citing diagrams and data gained further credence once sites like figshare started using DOIs.

Another reason resolvability matters is that I think it's a litmus test of how serious we are. One reason LSIDs failed is that we made them too hard to resolve, and as a consequence people simply minted "fake" LSIDs, dumb strings that didn't resolve. Nobody complained (because, let's face it, nobody was using them), so LSIDs became devalued to the point of uselessness. Anybody can mint a string and call it an identifier, if it costs nothing that's a good estimate of its actual value.


Resolvability leads to persistence. Sometimes we hear the cliche that "persistence is a social matter, not a technological one". This is a vacuous platitude. The kind of technology adopted can have a big impact on the sociology.

The easiest form of identifier is a simple HTTP URL. But let's think about what happens when we use them. If I spend a lot of time mapping my data to somebody else's URLs (e.g., links to papers or specimens) I am taking a big risk in assuming that the provider of those URLs will keep those "live". At the same time, in linking to those URLs, I constrain the provider - if they decide that their URL scheme isn't particularly good and want to change it (or their institution decides to move to new servers or a new domain), they will break resources like mine that link to them. So a decision they made about their URL structure - perhaps late one Friday afternoon in one of those meetings where everybody just wants to go to the pub - will come back to haunt them.

One way to tackle this is indirection, which is the idea behind DOIs and PURLs, for example. Instead of directly linking to a provider URL, we link to an intermediate identifier. This means that I have some confidence that all my hard work won't be undone (I have seen whole journals disappear because somebody redesigned an institutional web site), and the provider can mess with different technologies for serving their content, secure in the knowledge that external parties won't be affected (because they link to the intermediate identifier). Programmers will recognise this as encapsulation.

Some have argued that we can achieve persistence by simply insisting on it. For example, we fire off a memo to the IT folks saying "don't break these links!". Really? We have that degree of power over our institutional IT policies? This also misses the great opportunity that centralised indirection provides us with. In the case of DOIs for publications, CrossRef sits in the middle, managing the DOIs (in the sense that if a DOI breaks you have a single place to go and complain). Because they also aggregate all the bibliographic metadata, they are automatically able to support discoverability (they can easily map bibliographic metadata to DOIs). So by solving persistence we also solve discoverability.

Network effects

Lastly, if we are serious about this we need to think about how to engineer the widespread adoption of the identifier. In other words, I think we need network effects. When you join a social networking site, one of the first things they do is ask permission to see your "contacts" (who you already know). If any of those people are already on the network, you can instantly see that ("hey, Jane is here, and so is Bob"). Likewise, the network can target those you know who aren't on the network and prompt them to join.

If we are going to promote the use of identifiers, then it's no use thinking about simply adding identifiers to things, we need to think about ways to grow the network, ideally by adding networks at a time (like a person's list of contacts), not single records. CrossRef does this with articles: when publishers submit an article to CrossRef, they are encouraged to submit not just that article and it's DOI, but the list of all references in the list of literature cited, identified where possible by DOIs. This means CrossRef is building a citation graph, so it can quickly demonstrate value to its members (through cited-by linking).

So, we need to think of ways of demonstrating value, and growing the network of identifiers more rapidling than one identifier at a time. Otherwise, it is hard to see how it would gain critical mass. In the context of, say, specimens, I think an obvious way to do this is have services that tell a natural history collection how many times its specimens have been cited in the primary literature, or have been used as vouchers for DNA seqences. We can then generate metrics of use (as well as start to trace the provenance of our data).


I've no idea what will come out of the TDWG Workshop, but my own view is that unless we tackle these issues, and have a clear sense of how they interrelate, then we won't make much progress. These things are intertwined, and locally optimal solutions ("hey, it's easy, I'll just slap a URL on everything") aren't enough ("OK, how exactly do I find your URL? What happens when it breaks?"). If we want to link stuff together as part of the infrastructure of biodiversity informatics, then we need to think strategically. The goal is not to solve the identifier problem, the goal is to build the biodiversity knowledge graph.

Thursday, October 02, 2014

BioStor and JournalMap: a geographic interface to articles from the Biodiversity Heritage Library

The recent jump from ~11000 to ~17000 articles in JournalMap is mostly due to JournalMap ingesting content from my BioStor database. BioStor extracts articles from the Biodiversity Heritage Library (BHL), and in turn these get fed back into BHL as "parts" (you can see these in the "Table of Contents" tab when viewing a scanned volume in BHL).

In addition to extracting articles, BioStor pulls out latitude and longitude pairs mentioned in the OCR text and creates little Google Maps for articles that have geotagged content. Working with Jason Karl (@jwkarl), JournalMap now talks to BioStor and grabs all its geotagged articles so that you can browse them in JournalMap. As a consequence, journals such as Proceedings of The Biological Society of Washington now appear on their map (this journal is third most geotagged journal in JournalMap).

As an example of what you can do in JournalMap, here's a screenshot showing localities in Tanzania, and an article from BioStor being displayed:

JournalMap is an elegant interface to the biodiversity literature, and adding BioStor as a source is a nice example of how the Biodiversity Heritage Library's content is becoming more widely used. BioStor would only be possible if BHL made its content and metadata available for easy downloading. This is a lesson I wish other projects would learn. Instead of focussing on building flash-looking portals, make sure (a) you have lots of content, and (b) make it easy for developers to get that content so they can do cool things with it. BHL does well in this regard — other projects, such as BHL-Europe, not so much.

Tuesday, September 23, 2014

Exploring the chameleon dataset: broken GBIF links and lack of georeferencing

Following on from the discussion of the African chameleon data, I've started to explore Angelique Hjarding's data in more detail. The data is available from figshare (doi:10.6084/m9.figshare.1141858), so I've grabbed a copy and put it in github. Several things are immediately apparent.

  1. There is a lot of ungeoreferenced data. With a little work this could be geotagged and hence placed on a map.
  2. There are some errors with the georeferenced data (chameleons in Soutb America or off the coast, a locality in Tanzania that is now in Ethiopia, etc.).
  3. Rather alarmingly, most of the URLs to GBIF records that Angelique gives in the dataset no longer resolve.

The last point is worrying, and reflects the fact that at present you can't trust GBIF occurrence URLs to be stable over time. Most of the specimens in Angelique's data are probably still in GBIF, but the GBIF occurrenceID (and hence URL) will have changed. This pretty much kills any notion of reproducibility, and it will require some fussing to be able to find the new URLs for these records.

That the GBIF occurrenceIDs are no longer valid also makes it very difficult to make use of any data cleaning I or anyone else attempts with this data. If I georeference some of the specimens, I can't simply tell GBIF that I've got improved data. Nor is it obvious how I would give this information to the original providers using, say VertNet's github repositories. All in all a mess, and a sad reflection on our inability to have persistent identifiers for occurrences.

To help explore the data I've created some GeoJSON files to get a sense of the distribution of the data. Here are the point localities, a few have clearly got issues.

I also drew some polygons around points for the same taxon, to get a sense of their distributions.

Taxa represent by less than three distinct localities are presented by place marker, the rest by polygons.

I'll keep playing with this data as time allows, and try to get a sense of how hard it would be to go from what GBIF provides to what is actually going to be useful.