iPhylo: December 2016

Roderic D. M. Page

Friday, December 23, 2016

Taxonomic name timelines for BHL

Given a big corpus of literature one of the fun things to do is look at how the use of a term has changed over time. When did people first use a particular word? When did one word start to replace another, etc.? Google's Ngram Viewer is perhaps the best known tool for exploring these questions.

In the context of biodiversity doing something similar for BHL is an obvious thing to do. I've made various clunky attempts in the past (e.g., Biodiversity Heritage Library sparklines) but these all died.

Ryan Schenk (who did a lot of the user interface for my BioNames project) wrote a very stylish tool to display changes in names over time. Called "Synynyms" his tool is now defunct, but you can read about it here and the source code is on github. Ryan would take a name, find synonyms, then graph the changes in use of all those names over time.

Bison bison Linnaeus 1758 synynyms 1024x675

The death of Synynyms has not gone unnoticed:

@rdmpage Is there a service simiar to synynyms to check for nomen usage? synynyms is not available anymore :(
— JCAM (@PhyloJCAM) December 12, 2016

I've had a tool for my own use that searches BHL for a name and displays the results after first trying to aggregate the hits in a sensible way. For example, if there is more than one hit in a scanned volume, and those hits al fall on pages in the same article in BioStor, then I display the BioStor article, instead of a list of each hit separately. Inspired by @PhyloJCAM's question I've built a simple tool to explore the use of one or more name over time.

Located in the "labs" section of BioStor, the BHL timeline takes one or more names and searches for those names in BHL, displaying the results as a chart and a list of hits. I often use it simply to search BHL for a particular name, but you can also use it to compare names, e.g. Aspidoscelis costata and Cnemidophorus costatus:

The timeline tool is pretty crude, and it's slow if there are lots of hits in BHL. So, it's not as slick as Synynyms (Ryan Schenk is a clever programmer than I am). Still, it is a useful way to explore BHL and discover articles that you might not have known existed.

@rdmpage @ryanschenk working great with uninomials, one improvement could be to allow for regular expressions, but already awesme as it is! pic.twitter.com/W6TTRXwszM
— JCAM (@PhyloJCAM) December 15, 2016

Thursday, December 22, 2016

DNA barcoding taxonomy now in GBIF

220px The Face of a Lupine Blue Following on from adding DNA barcodes to GBIF I've now uploaded a taxonomic classification of DNA barcode BINs (Barcode Index Numbers). Each BIN is a cluster of similar DNA barcodes that is essentially equivalent to a species. For more details see:

Ratnasingham, S., & Hebert, P. D. N. (2013, July 8). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.), PLoS ONE. Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0066213

The data I've uploaded was obtained by screen scraping the BOLD web site for each BIN in the DNA barcode dataset (BOLD's API doesn't let me get all the information I want). In addition to the taxonomic hierarchy associated with each BIN I've also extracted any publications mentioned on the BIN page, and subsequently tried to link those to the corresponding DOI, if the publication has one. The code for all this is available on GitHub https://github.com/rdmpage/bold-bins, which also serves as the host for the Darwin Core Archive for this dataset. There's a neat trick where you can use a .gitattributes file to tell GitHub not store certain files in the zip file it creates for the repository (see Excluding files from git archive exports using gitattributes by @fmarier).

Having done this, I've a few thoughts.

Please, please use DOIs for articles

BOLD pages for BINs often include one or more papers that published the barcodes included in that BIN. This is great, but often links to these papers are pretty strange:

Ever wonder why DOIs are nice? Take a look at this URL for an article :O This is why you want to store DOIs in databases, not links... pic.twitter.com/QzthcL8f7k
— Roderic Page (@rdmpage) December 20, 2016

If you are going to store literature in a database treat links to articles with great care. they are often full of extraneous stuff that depends on how the user reached that article online. DOIs greatly simplify this process. Instead of a URL like http://onlinelibrary.wiley.com/store/10.1111/j.1755-0998.2009.02650.x/asset/j.1755-0998.2009.02650.x.pdf?v=1&t=hellc54c&s=e14bbc4146b66a051ad5cd1f5361ac2e16dc5831&systemMessage=Pay+Per+View+will+be+unavailable+for+upto+3+hours+from+06%3A00+EST+March+23rd+ (I kid you not) you should use the DOI 10.1111/j.1755-0998.2009.02650.x.

Adding DOIs to these articles means GBIF will display them on the corresponding species page, for example Centromerus sylvaticus (Blackwall, 1841) has links to these two papers:

Telfer, A., deWaard, J., Young, M., Quinn, J., Perez, K., Sobel, C., … Hebert, P. (2015, August 30). Biodiversity inventories in high gear: DNA barcoding facilitates a rapid biotic survey of a temperate nature reserve. Biodiversity Data Journal. Pensoft Publishers. https://doi.org/10.3897/bdj.3.e6313

Blagoev, G. A., deWaard, J. R., Ratnasingham, S., deWaard, S. L., Lu, L., Robertson, J., … Hebert, P. D. N. (2015, July 26). Untangling taxonomy: a DNA barcode reference library for Canadian spiders. Molecular Ecology Resources. Wiley-Blackwell. https://doi.org/10.1111/1755-0998.12444

Now GBIF users can easily explore what we know about barcodes from this species by going directly to the primary literature.

Dark taxa

In an earlier post I discussed dark taxa, which are taxa that lack formal scientific names. BOLD is full of these, so many of the taxa I've added to GBIF don't have Linnean names. Instead I've used a combination of higher taxon name and the BIN itself.

Composite taxa

Having said that BINs are essentially the same as species, this need not imply that there's a one-to-one match between BINs and currently recognised species (indeed, this is of the things that makes barcoding so interesting, it's ability to discover hidden variation without taxa currently considered to be a single species). This means that some BINs will have the same name (significant variation within a species), and some BINs will have multiple names (more than one species name assigned to the same BIN). For example, BOLD:AAA2525 is a cluster of DNA barcodes with the following names attached:

Icaricia lupini
Icaricia acmon
Icaricia neurona
Plebejus lupini
Aricia sp. RV-2009
Aricia acmon
Plebejus acmon
Plebejus elvira
Icaricia lupini texanus
Icaricia lupini monticola
Icaricia lupini chlorina
Icaricia lupini lupini
Icaricia lupini alpicola

This cluster of names includes subspecies, synonyms (e.g. ). Looking at the phylogeny for this BIN (PDF-only) some of these names are intermingled suggesting that some specimens might be misidentified, apparently Icaricia lupini and I. acmon are very similar:

Coutsis, J. G. (2011). The male genitalia of N American Icaricia lupini and I. acmon; how they differ from each other and how they compare to those of the other two members of the group, I. neurona and I. shasta (Lepidoptera: Lycaenidae, Polyommatiti). Phegea, 39(4), 144-151. Retrieved from http://biostor.org/reference/160269

Summary

This is a first attempt to integrate DNA barcode taxonomy into GBIF, so there are going to be some issues to explore. GBIF currently assumes taxa can be easily mapped to a Linnean hierarchy. While this is ultimately likely to be true for animal COI barcodes, getting there is going to be messy while we have numerous dark taxa and/or BINs which don't match the current identifications of the voucher specimens.

Perhaps it's worth asking whether attempt to fit the results of DNA barcoding into a classical taxonomy is the best way forward. In doing so we loose much of what makes barcodoing so powerful, namely a specimen-level phylogenetic tree. Maybe what we should be really thinking about is ways to explore barcoding data natively. See Notes on next steps for the million DNA barcodes map for some thoughts on how to do that.

Image from Wikimedia Commons The Face of a Lupine Blue by Ingrid Taylar.

Thursday, December 08, 2016

iBOL DNA barcodes in GBIF

I've uploaded all the COI barcodes in the iBOL public data dumps into GBIF. This is an update of an earlier project that uploaded a small subset. Now that dataset doi:10.15468/inygc6 has been expanded to include some 2.7 million barcodes. In the new GBIF portal (work in progress) the map for these barcodes looks like this:

Many of these records have images of the specimens that were sequenced, and the new GBIF "gallery" feature displays these nicely, e.g.:

Having done this, I've a few thoughts.

Why did I do this?

Why did I do this, or, put another why didn't iBOL do this already? In an ideal world, iBOL would be an active contributor to GBIF and would be routinely uploading barcodes. Since this isn't happening, I've gone ahead and uploaded the barcodes myself. From my perspective, I want as much data to be as discoverable and as accessible as possible, hence if need be I'll grab data from wherever it lives and add it to GBIF (for an earlier example see The Zika virus, GBIF, and the missing mosquitoes). A downside of this is that, long term, the relationship between data provider and GBIF may be as valuable to GBIF as the data, and simply grabbing and reformatting data doesn't, by itself, form that relationship. But in the absence of a working relationship I still need the data.

Where are the taxonomic names

Lots of barcodes lack formal scientific names, even though in many cases BOLD has them. The data in the public dumps often lacks this information. A next logical step would be to harvest data from the BOLD API and add taxonomic names as "identifications".

Where are the sequences?

The sequences themselves aren't in GBIF, which on the one hand is not surprising as GBIF isn't a sequence databases. However, I think it should be, in the sense that for a lot of biodiversity sequences are going to be the only way forward. This includes the eukaryote barcodes, bacterial sequences, and metabarcodes. Fundamentally sequences are just strings of letters, and GBIF already handles those (e.g., taxonomic names, geographic places, etc.). Furthermore, the following paper by Bittner et al. makes a strong case that rather than knowing "what is there?" it's more important to know "what are they doing?"

Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biology Direct. Springer Nature. https://doi.org/10.1186/1745-6150-5-47

In other words, a functional approach may matter more than a purely taxonomic approach to diversity. For a big chunk of biology this is going to depend on analysing sequences. Even if we restrict ourselves to just taxonomic diversity, there is scope for expanding our notion of what we display once we have sequences and evolutionary trees, e.g. Notes on next steps for the million DNA barcodes map.