Thursday, December 08, 2016

iBOL DNA barcodes in GBIF

I've uploaded all the COI barcodes in the iBOL public data dumps into GBIF. This is an update of an earlier project that uploaded a small subset. Now that dataset doi:10.15468/inygc6 has been expanded to include some 2.7 million barcodes. In the new GBIF portal (work in progress) the map for these barcodes looks like this:

Screenshot 2016 12 07 22 58 43

Many of these records have images of the specimens that were sequenced, and the new GBIF "gallery" feature displays these nicely, e.g.:

Screenshot 2016 12 08 10 04 00

Having done this, I've a few thoughts.

Why did I do this?

Why did I do this, or, put another why didn't iBOL do this already? In an ideal world, iBOL would be an active contributor to GBIF and would be routinely uploading barcodes. Since this isn't happening, I've gone ahead and uploaded the barcodes myself. From my perspective, I want as much data to be as discoverable and as accessible as possible, hence if need be I'll grab data from wherever it lives and add it to GBIF (for an earlier example see The Zika virus, GBIF, and the missing mosquitoes). A downside of this is that, long term, the relationship between data provider and GBIF may be as valuable to GBIF as the data, and simply grabbing and reformatting data doesn't, by itself, form that relationship. But in the absence of a working relationship I still need the data.

Where are the taxonomic names

Lots of barcodes lack formal scientific names, even though in many cases BOLD has them. The data in the public dumps often lacks this information. A next logical step would be to harvest data from the BOLD API and add taxonomic names as "identifications".

Where are the sequences?

The sequences themselves aren't in GBIF, which on the one hand is not surprising as GBIF isn't a sequence databases. However, I think it should be, in the sense that for a lot of biodiversity sequences are going to be the only way forward. This includes the eukaryote barcodes, bacterial sequences, and metabarcodes. Fundamentally sequences are just strings of letters, and GBIF already handles those (e.g., taxonomic names, geographic places, etc.). Furthermore, the following paper by Bittner et al. makes a strong case that rather than knowing "what is there?" it's more important to know "what are they doing?"

Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biology Direct. Springer Nature.

In other words, a functional approach may matter more than a purely taxonomic approach to diversity. For a big chunk of biology this is going to depend on analysing sequences. Even if we restrict ourselves to just taxonomic diversity, there is scope for expanding our notion of what we display once we have sequences and evolutionary trees, e.g. Notes on next steps for the million DNA barcodes map.