Friday, July 20, 2018

Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden

Supertree Grove Gardens by the Bay Singapore 20120630 04 Earlier this year I stopped over in Singapore, home of the spectacular "supertrees" in the Garden by the Bay. The trip was a holiday, but I spent a good part of one day visiting Rudolf Meier's group at the National University of Singapore. Chatting with Rudolf was great fun, he's opinionated and not afraid to share those opinions with anyone who will listen. Belatedly I've finally written up some of the topics we discussed.

Massively scalable and cheap DNA barcoding

Singapore has a rich fauna in a small area, full of undescribed species, so DNA barcoding seems an obvious way to get a handle on its biodiversity. Rudolf has been working towards scalable and cheap barcoding, e.g. $1 DNA barcodes for reconstructing complex phenomes and finding rare species in specimen‐rich samples https://doi.org/10.1111/cla.12115 . His lab can sequence short (~300 bp) barcode sequences for around $US 0.50 per specimen. Their pipeline generates lots of data, accompanied by high quality photographs of exemplar specimens, which contribute to The Biodiversity of Singapore, a "Digital Reference Collection for Singapore's Biodiversity". This site provides a simple but visually striking way to explore Singapore's biota, and is a nice example of what Mitchell Whitelaw calls "generous interfaces". We could do with more of these for biodiversity data.

Screenshot 2018 07 20 05 01

One nice feature of regular COI DNA barcodes is that they are comparable across labs because everyone is sequencing the same stretch of DNA. With short barcodes, different groups may target different regions of the COI gene, resulting in sequences that can't be compared. For example, the 127bp mini barcodes developed in A universal DNA mini-barcode for biodiversity analysis https://doi.org/10.1186/1471-2164-9-214 are completely disjoint from the ~300bp sequenced by Meier's group (I'm trying to keep track of some of these short barcodes here: https://gist.github.com/rdmpage/4f2545eeea4756565925fb4307d9af6b.

The return of regional faunas

In the "old days" of colonial expansion it was common for taxonomists to write volume entitled "The Fauna of [insert colonised country here]". These were regional works focussing on a particular area, often motivated by trying to catalogue animals of potential economic or medical importance, as well as of scientific interest. By limiting their geographic scope, faunal treatments of taxa can sometimes be inadequate. Descriptions of new species from a particular area may be hard to compare with descriptions of species in the same group that occur elsewhere and are described by other taxonomists. It may be that to do the taxonomy of a particular group well you need to treat that group throughout its geographic range, rather then just those species in your geographic area. Hence faunas loose their scientific appeal, despite the attractiveness of having a detailed summary of the fauna of a particular area. DNA sequencing circumvents this problem by having a universally comparable character. You can sequence everything within a geographic region, but those sequences will be directly comparable to sequences found elsewhere. Barcoding makes faunas attractive again, which may help funding taxonomic research because it makes funding projects with a restricted national scope scientifically still worthwhile.

Taxonomic burden and legacy names

As we discover and catalogue more and more of the planet's biodiversity we want to stick names on that biodiversity, and this can be a significant challenge when there is a taxonomic legacy of names that are so poorly described it is hard to establish how they relate to the material we are working with. Even if you have access to the primary literature through digitisation projects like BHL, if the descriptions are poor, if the types are lost or their identity is confused (see for example A New Species of Megaselia Rondani (Diptera: Phoridae) from the Bioscan Project in Los Angeles, California, with Clarification of Confused Type Series for Two Other Species https://doi.org/10.4289/0013-8797.118.1.93 by Emily A. Hartop - who I met on this trip - and colleagues), or can't be sequenced, then these names will remain ambiguous, and potentially clogging up efforts to name the unnamed species. One approach favoured by Rudolf is to effectively wipe the slate clean, declare all ambiguous names before a certain date to be null and void, and start again. This renders (or rather, resets) the notion of priority - given two names for the same species the older name is the one to use - and so is likely to be a hard sell, but it is part of the ongoing discussion about the impact of molecular data on naming taxa. Similar discussions are raging at the moment in mycology, e.g. Ten reasons why a sequence-based nomenclature is not useful for fungi anytime soon https://doi.org/10.5598/imafungus.2018.09.01.11, yet a another reflection of how much taxonomy is driven by technology.

Thursday, July 05, 2018

GBIF at 1 billion - what's next?

GBIF has reached 1 billion occurrences which is, of course, something to celebrate:

An achievement on this scale represents a lot of work by many people over many years, years spent developing simple standards for sharing data, agreeing that sharing is a good thing in the first place, tools to enable sharing, and a place to aggregate all that shared data (GBIF).

So, I asked a question:

My point is not to do this:

Rather it is to encourage a discussion about what happens when we have large amounts of biodiversity data. Is it the case that as we add data we simply enable more of the same kind of science, only better (e.g., more data for species distribution modelling), or do we reach a point where new things become possible?

Document

To give a concrete example, consider iNaturalist. This started out as a Masters project to collect photos of organisms on Flickr. As you add more images you get better coverage of biodiversity, but you still have essentially a bunch of pictures. But once you have LOTS of pictures, and those are labelled with species names, you reach the point where it is possible to do something much more exciting - automatic species identification. To illustrate, I recently took the photos below:

Large2 Large

Note the reddish tubular growths on the leaves. I asked iNaturalist to identify these photos and within a few seconds it came back with Eriophyes tiliae, the Red Nail Gall Mite. This feels like magic. It doesn't rely on complicated analysis of the image (as many earlier efforts at automated identification have done) it simply "knows" that images that look like this are typically of the galls of this mite because it has seen many such images before. (Another example of the impact of big data is Google Translate, initially based on parsing lots of examples of the same text in multiple languages.)

The "1 billion" number is not, by itself, meaningful. It's rather that I hope that while we're popping the champagne and celebrating a welcome, if somewhat arbitrary milestone, I'm hoping that someone, somewhere is thinking about whether biodiversity data on this scale enables something new.

Do I have answers? Not really, but here's one fairly small-scale example. One of the big challenges facing GBIF is getting georeferenced data. We spend a lot of time using a variety of tools and databases to convert text descriptions one collection localities into latitude and longitude. Many of these descriptions include phrases such as "5 mi NW of" and so we've developed parsers to attempt to make sense of these. All of these phrases and the corresponding latitude and longitude coordinates have ended up in GBIF. Now, this raises the possibility that after a point, pretty much any locality phrase will be in GBIF, so a way to georeference a locality is simply to search GBIF for that locality and use the associated latitude and longitude. GBIF itself becomes the single best tool to georeference specimen data. To explore this idea I've built a simple tool on glitch https://lyrical-money.glitch.me that takes a locality description and geocodes it using GBIF.

Screenshot 2018 07 05 07 32

You paste in a locality string and it attempt to find that on a map based on data in GBIF. This could be automated, so you could imagine being able to georeference whole collections as part of the process of uploading the data to GBIF. Yes, the devil is in the details, and we'd need ways to flag errors or doubtful records, but the scale of GBIF starts of open up possibilities like this.

So, my question is, "what's next?".