Thursday, December 22, 2016

DNA barcoding taxonomy now in GBIF

220px The Face of a Lupine BlueFollowing on from adding DNA barcodes to GBIF I've now uploaded a taxonomic classification of DNA barcode BINs (Barcode Index Numbers). Each BIN is a cluster of similar DNA barcodes that is essentially equivalent to a species. For more details see:

Ratnasingham, S., & Hebert, P. D. N. (2013, July 8). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.), PLoS ONE. Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0066213
The data I've uploaded was obtained by screen scraping the BOLD web site for each BIN in the DNA barcode dataset (BOLD's API doesn't let me get all the information I want). In addition to the taxonomic hierarchy associated with each BIN I've also extracted any publications mentioned on the BIN page, and subsequently tried to link those to the corresponding DOI, if the publication has one. The code for all this is available on GitHub https://github.com/rdmpage/bold-bins, which also serves as the host for the Darwin Core Archive for this dataset. There's a neat trick where you can use a .gitattributes file to tell GitHub not store certain files in the zip file it creates for the repository (see Excluding files from git archive exports using gitattributes by @fmarier).

Having done this, I've a few thoughts.

Please, please use DOIs for articles

BOLD pages for BINs often include one or more papers that published the barcodes included in that BIN. This is great, but often links to these papers are pretty strange:

If you are going to store literature in a database treat links to articles with great care. they are often full of extraneous stuff that depends on how the user reached that article online. DOIs greatly simplify this process. Instead of a URL like http://onlinelibrary.wiley.com/store/10.1111/j.1755-0998.2009.02650.x/asset/j.1755-0998.2009.02650.x.pdf?v=1&t=hellc54c&s=e14bbc4146b66a051ad5cd1f5361ac2e16dc5831&systemMessage=Pay+Per+View+will+be+unavailable+for+upto+3+hours+from+06%3A00+EST+March+23rd+ (I kid you not) you should use the DOI 10.1111/j.1755-0998.2009.02650.x.

Adding DOIs to these articles means GBIF will display them on the corresponding species page, for example Centromerus sylvaticus (Blackwall, 1841) has links to these two papers:

Telfer, A., deWaard, J., Young, M., Quinn, J., Perez, K., Sobel, C., … Hebert, P. (2015, August 30). Biodiversity inventories in high gear: DNA barcoding facilitates a rapid biotic survey of a temperate nature reserve. Biodiversity Data Journal. Pensoft Publishers. https://doi.org/10.3897/bdj.3.e6313
Blagoev, G. A., deWaard, J. R., Ratnasingham, S., deWaard, S. L., Lu, L., Robertson, J., … Hebert, P. D. N. (2015, July 26). Untangling taxonomy: a DNA barcode reference library for Canadian spiders. Molecular Ecology Resources. Wiley-Blackwell. https://doi.org/10.1111/1755-0998.12444
Now GBIF users can easily explore what we know about barcodes from this species by going directly to the primary literature.

Dark taxa

In an earlier post I discussed dark taxa, which are taxa that lack formal scientific names. BOLD is full of these, so many of the taxa I've added to GBIF don't have Linnean names. Instead I've used a combination of higher taxon name and the BIN itself.

Composite taxa

Having said that BINs are essentially the same as species, this need not imply that there's a one-to-one match between BINs and currently recognised species (indeed, this is of the things that makes barcoding so interesting, it's ability to discover hidden variation without taxa currently considered to be a single species). This means that some BINs will have the same name (significant variation within a species), and some BINs will have multiple names (more than one species name assigned to the same BIN). For example, BOLD:AAA2525 is a cluster of DNA barcodes with the following names attached:
  • Icaricia lupini
  • Icaricia acmon
  • Icaricia neurona
  • Plebejus lupini
  • Aricia sp. RV-2009
  • Aricia acmon
  • Plebejus acmon
  • Plebejus elvira
  • Icaricia lupini texanus
  • Icaricia lupini monticola
  • Icaricia lupini chlorina
  • Icaricia lupini lupini
  • Icaricia lupini alpicola
This cluster of names includes subspecies, synonyms (e.g. ). Looking at the phylogeny for this BIN (PDF-only) some of these names are intermingled suggesting that some specimens might be misidentified, apparently Icaricia lupini and I. acmon are very similar:
Coutsis, J. G. (2011). The male genitalia of N American Icaricia lupini and I. acmon; how they differ from each other and how they compare to those of the other two members of the group, I. neurona and I. shasta (Lepidoptera: Lycaenidae, Polyommatiti). Phegea, 39(4), 144-151. Retrieved from http://biostor.org/reference/160269

Summary

This is a first attempt to integrate DNA barcode taxonomy into GBIF, so there are going to be some issues to explore. GBIF currently assumes taxa can be easily mapped to a Linnean hierarchy. While this is ultimately likely to be true for animal COI barcodes, getting there is going to be messy while we have numerous dark taxa and/or BINs which don't match the current identifications of the voucher specimens.

Perhaps it's worth asking whether attempt to fit the results of DNA barcoding into a classical taxonomy is the best way forward. In doing so we loose much of what makes barcodoing so powerful, namely a specimen-level phylogenetic tree. Maybe what we should be really thinking about is ways to explore barcoding data natively. See Notes on next steps for the million DNA barcodes map for some thoughts on how to do that.

Image from Wikimedia Commons The Face of a Lupine Blue by Ingrid Taylar.