Tuesday, February 21, 2012

Linking GBIF and Genbank

As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy. Specimen codes are not unique, are written in all sorts of ways, there are multiple codes for the same specimen (GenBank sequences may be associated with museum catalogue entries, or which field or collector numbers).

So why undertake what is fast looking like a hopeless task? There are several reasons:
  1. GBIF occurrences have a unique URL which we could potentially use as a unique, resolvable identifier for the corresponding specimen.
  2. Linking GenBank to GBIF would make it possible for GBIF to list sequences associated with a specimen, as well as the associated publication, which means we could demonstrate the "impact" of a specimen. In the simplest terms this could be the number of sequences and publications that use data from the specimen, more sophisticated approaches could use PageRank-like measures, see hdl:10101/npre.2008.1760.1.
  3. Having a unique identifier that is shared across different databases makes it easier to combine data from different sources. For example, if a sequence in GenBank lacks geographic coordinates but the voucher specimen in GBIF is georeferenced, we can use that information to locate the sequence in geographic space (and hence build geophylogenies or add spatial indexes to databases such as TreeBASE). Conversely, if the GenBank sequence is georeferenced but the GBIF record isn't we can update the GBIF record and possibly expand the range of the corresponding taxon (this was part of the motivation behind hdl:10101/npre.2009.3173.1.

As an example, below is the GBIF 1° density map for the frog Pristimantis ridens from GBIF, with the phylogeny from Wang et al.Phylogeography of the Pygmy Rain Frog (Pristimantis ridens) across the lowland wet forests of isthmian Central Americahttp://dx.doi.org/10.1016/j.ympev.2008.02.021 layered over it. I created the KML tree from the corresponding tree in TreeBASE using the tool I described earlier. You can grab the KML for the tree here.


As we'd expect, there is a lot of overlap in the two sources of data. If we investigate further, there are records that are in fact based on the same specimen. For example, if we download the GBIF KML file with individual placemarks we see that in the northern part of the range their are 15 GBIF occurrences that map onto the same point as one of the terminal taxa in the tree.


One of these 15 GBIF records (http://data.gbif.org/occurrences/244335848) is for specimen USNM 514547, which is the voucher specimen for EU443175. This gives us a link between the record in GBIF and the record in GenBank. It also gives us a URI we can use for the specimen http://data.gbif.org/occurrences/244335848 instead of the unresolvable and potentially ambiguous USNM 514547.

If we view the geophylogeny from a different vantage point we see numerous localities that don't have occurrences in GBIF.


Close inspection reveals that some of the specimens listed in the Wang et al. paper are actually in GBIF, but lack geographic coordinates. For example the OTU "Pristimantis ridens Nusagandi AJC 0211" has the voucher specimen FMNH 257697. This specimen is in GBIF as http://data.gbif.org/occurrences/57919777/, but without coordinates, so it doesn't appear on the GBIF map. However, both the Wang et al. paper and the GenBank record for the sequence from this specimen EU443164 give the latitude and longitude. In this example, GBIF gives us a unique identifier for the specimen, and GenBank provides data on location that GBIF lacks.

Part of GBIFs success is due to the relative ease of integrating data by taxonomic names (despite the problems caused by synonyms, homonyms, misspellings, etc.) or using spatial coordinates (which immediately enables integration with environmental data. But if we want to integrate at deeper levels then specimen records are the glue that connects GBIF (and its contributing data sources) to sequence databases, phylogenies, and the taxonomic literature (via lists of material exampled). This will not be easy, certainly for legacy data that cites ambiguous specimen codes, but I would argue that the potential rewards are great.