Showing posts with label DNA barcoding. Show all posts
Showing posts with label DNA barcoding. Show all posts

Tuesday, April 06, 2021

It's been a while...

Is it's been a while since I've blogged here. The last few months have been, um, interesting for so many reasons. Meanwhile in my little corner of the world there's been the constant challenge of rethinking how to teach online-only, whilst also messing about with a bunch of things in a rather unfocused way (and spending way too much time populating Wikidata). So here I'll touch on a few rather random topics that have come up in the last few months, and elsewhere on this blog I'll try and focus on some of the things that I'm working on. In many ways this blog post is really to serve as a series of bookmarks for things I'd like to think about a bit more.

Taxonomic precision and the "one tree"

One thing that had been bugging me for a while was my inability to find the source of a quote about taxonomic precision that I remembered as a grad student. I was pretty sure that David Penny and Mike Handy had said it, but where? Found it at last:

Biologists seem to seek “The One Tree” and appear not to be satisfied by a range of options. However, there is no logical difficulty with having a range of trees. There are 34,459,425 possible trees for 11 taxa (Penny et al. 1982), and to reduce this to the order of 10-50 trees is analogous to an accuracy of measurement of approximately one part in 106.

Many measurements in biology are only accurate to one or two significant figures and pale when compared to physical measurements that may be accurate to 10 significant figures. To be able to estimate an accuracy of one tree in 106 reflects the increasing sophistication of tree reconstruction methods. (Note that, on this argument, to identify an organism to a species is also analogous to a measurement with an accuracy of approximately one in 106.). — "Estimating the reliability of evolutionary trees" p.414 doi:10.1093/oxfordjournals.molbev.a040407

I think this quote helps put taxonomy and phylogeny in the broader context of quantitative biology. Building trees that accurately place taxa is a computationally challenging task that yields some of the most precise measurements in biology.

Barcodes for everyone

This is yet another exciting paper from Rudolf Meier's lab (see earlier blog post Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden). The preprint doi:10.1101/2021.03.09.434692 is on bioRxiv. It feels like we are getting ever-closer to the biodiversity tricorder.

Barcodes for Australia

Donald Hobern (@dhobern) has been blogging about insects collected in malaise traps in Aranda, Australian Capital Territories (ACT). The insects are being photographed (see stream on Flickr) and will be barcoded.

No barcodes please we're taxonomists!

A paper with a title like "Minimalist revision and description of 403 new species in 11 subfamilies of Costa Rican braconid parasitoid wasps, including host records for 219 species" (Harvey et al. doi:10.3897/zookeys.1013.55600 was always likely to cause problems, and sure enough some taxonomists had a meltdown. A lot of the arguments centered around whether DNA sequences counted as words, which seems surreal. DNA sequences are strings of characters, just like natural language. Unlike English, not all languages have word breaks. Consider Chinese for example, where search engines can't break text up into words for indexing, but instead use n-grams. I mention this simply because n-grams are a useful way to index DNA sequences and to compute sequence similarly without performing a costly sequence alignment. I used this technique in my DNA barcode browser. If we move beyond arguments about whether a picture and a DNA sequence is enough to describe a species (if all species every discovered were described this way we'd arguably be much better off than we are now) I think there is a core issue here, namely the relative size of the intersection between taxa that have been described classically (i.e., with words) and those described almost entirely by DNA (e.g., barcodes) will likely drop as more and more barcoding is done, and this has implications for how we do biology (see Dark taxa: GenBank in a post-taxonomic world).

Bioschema

The dream of linked data rumbles on. Schema.org is having a big impact on standardising basic metadata encoded in web sites, so much so that anyone building a web site now needs to be familiar with schema.org if you want your site to do well in search engine rankings. I made extensive use of schema.org to model bibliographic data on Australian animals for my Ozymandias project.

Bioschemas aims to provide a biology-specific extension to schema.org, and is starting to take off. For example, GBIF pages for species now have schema.org embedded as JSON-LD, e.g. the page for Chrysochloris visagiei Broom, 1950 has this JSON-LD embedded in a <script type="application/ld+json"> tag:

{ "@context": [ "https://schema.org/", { "dwc": "http://rs.tdwg.org/dwc/terms/", "dwc:vernacularName": { "@container": "@language" } } ], "@type": "Taxon", "additionalType": [ "dwc:Taxon", "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept" ], "identifier": [ { "@type": "PropertyValue", "name": "GBIF taxonKey", "propertyID": "http://www.wikidata.org/prop/direct/P846", "value": 2432181 }, { "@type": "PropertyValue", "name": "dwc:taxonID", "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID", "value": 2432181 } ], "name": "Chrysochloris visagiei Broom, 1950", "scientificName": { "@type": "TaxonName", "name": "Chrysochloris visagiei", "author": "Broom, 1950", "taxonRank": "SPECIES", "isBasedOn": { "@type": "ScholarlyArticle", "name": "Ann. Transvaal Mus. vol.21 p.238" } }, "taxonRank": [ "http://rs.gbif.org/vocabulary/gbif/rank/species", "species" ], "dwc:vernacularName": [ { "@language": "eng", "@value": "Visagie s Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "", "@value": "Visagie's golden mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "deu", "@value": "Visagie-Goldmull" } ], "parentTaxon": { "@type": "Taxon", "name": "Chrysochloris Lacépède, 1799", "scientificName": { "@type": "TaxonName", "name": "Chrysochloris", "author": "Lacépède, 1799", "taxonRank": "GENUS", "isBasedOn": { "@type": "ScholarlyArticle", "name": "Tabl. Mamm. p.7" } }, "identifier": [ { "@type": "PropertyValue", "name": "GBIF taxonKey", "propertyID": "http://www.wikidata.org/prop/direct/P846", "value": 2432177 }, { "@type": "PropertyValue", "name": "dwc:taxonID", "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID", "value": 2432177 } ], "taxonRank": [ "http://rs.gbif.org/vocabulary/gbif/rank/genus", "genus" ] } }

For more details on the potential of Bioschemas see Franck Michel's TDWG Webinar.

OCR correction

Just a placeholder to remind me to revisit OCR correction and the dream of a workflow to correct text for BHL. I came across hOCR-Proofreader (which has a Github repo). Internet Archive now provides hOCR files as one of its default outputs, so we're getting closer to a semi-automated workflow for OCR correction. For example, imagine having all this set up on Github so that people can correct text and push those corrections to Github. So close...

Roger Hyam keeps being awesome

Roger just keeps doing cool things that I keep learning from. In the last few months he's been working on a nice interface to the World Flora Online (WFO) which, let's face it, is horrifically ugly and does unspeakable things to the data. Roger is developing a nicer interface and is doing some cool things under the hood with identifiers that inspired me to revisit LSIDs (see below).

But the other thing Roger has been doing is using GraphQL to provide a clean API for the designer working with him to use. I have avoided GraphQL because it couldn't see what problem it solved. It's not a graph query language (despite the name), it breaks HTTP caching, it just seemed that it was the SOAP of today. But, if Roger's using it, I figured there must be something good here (and yes, I'm aware that GraphQL has a huge chunk of developer mindshare). As I was playing with yet another knowledge graph project I kept running into the challenge of converting a bunch of SPARQL queries into something that could be easily rendered in a web page, which is when the utility of GraphQL dawned on me. The "graph" in this case is really a structured set of results that correspond to the information you want to render on a web page. This may be the result of quite a complex series of queries (in my case using SPARQL on a triple store) that nobody wants to actually see. The other motivator was seeing DataCite's use of GraphQL to query the "PID Graph". So, I think I get it now, in the sense that I see why it is useful.

LSIDs back from the dead

In part inspired by Roger Hyam's work on WFO I released a Life Science Identifier (LSID) Resolver to make LSIDs resolvable. I'll spare you the gory details, but you can think of LSIDs as DOIs for taxonomic names. They came complete with a decentralised resolution mechanism (based on Internet domain names) and standards for what information they return (RDF as XML), and millions were minted for animal, fungi, and plant names. For various reasons they didn't really take off (they were technically tricky to use and didn't return information in a form people could read, so what were the chances?). Still, they contain a lot of valuable information for those of us interested in having lists of names linked to the primary literature. Over the years I have been collecting them and wanted a way to make them available. I've chosen a super-simple approach based on storing them in compressed form in GitHub and wrapping that repo in simple web site. Lots of limitations, but I like the idea that LSIDs actually, you know, persist.

DOIs for Biodiversity Heritage Library

In between everything else I've been working with BHL to add DOIs to the literature that they have scanned. Some of this literature is old and of limited scientific value (but sure looks pretty - Nicole Kearney is going to take me to task for saying that), but a lot of it is recent, rich, and scientifically valuable. I'm hoping that the coming months will see a lot of this literature emerge from relative obscurity and become a first class citizen of the taxonomic and biodiversity literature.

Summary

I guess something meaningful and deep should go here... nope, I'm done.

Wednesday, October 21, 2020

GBIF Challenge success

Somewhat stunned by the fact that my DNA barcode browser I described earlier was one of the (minor) prizewinners in this year's GBIF Ebbe Nielsen Challenge. For details on the winner and other place getters see ShinyBIOMOD wins 2020 GBIF Ebbe Nielsen Challenge. Obviously I'm biased, but it's nice to see the challenge inspiring creativity in biodiversity informatics. Congratulations to everyone who took part.

My entry is live at https://dna-barcode-browser.herokuapp.com. I had a difficult time keeping it online over the summer due to meow attacks, but managed to sort that out. Turns out the cloud provider I used to host Elasticsearch switched from securing the server by default to making it completely open to anyone, and I'd missed that change.

Given that the project was a success, I'm tempted to revisit it and explore further ways to combine phylogenetic trees in a biodiversity browser.


Wednesday, July 22, 2020

DNA barcode browser

Motivated by the 2020 Ebbe Nielsen Challenge I've put together an interactive DNA barcode browser. The app is live at https://dna-barcode-browser.herokuapp.com.


A naturalist from the 19th century would find little in GBIF that they weren’t familiar with. We have species in a Linnean hierarchy, their distributions plotted on a map. This method of summarising data is appropriate to much of the data in GBIF, but impoverishes the display of sequence data such as barcodes. Given a set of DNA barcodes we can compute a phylogeny for those sequences, and gain evidence for taxonomic groups, intraspecific genetic structure, etc. So I wanted to see if it was possible to make simple tool to interactively explore barcode data. This means we need fast methods for searching for similar sequences, and building phylogenies. I've been experimenting with ways to do this for the last couple of years, but have only now managed to put something together. For more details, see the repository. There is also a quick introductory video.

Friday, July 20, 2018

Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden

Supertree Grove Gardens by the Bay Singapore 20120630 04 Earlier this year I stopped over in Singapore, home of the spectacular "supertrees" in the Garden by the Bay. The trip was a holiday, but I spent a good part of one day visiting Rudolf Meier's group at the National University of Singapore. Chatting with Rudolf was great fun, he's opinionated and not afraid to share those opinions with anyone who will listen. Belatedly I've finally written up some of the topics we discussed.

Massively scalable and cheap DNA barcoding

Singapore has a rich fauna in a small area, full of undescribed species, so DNA barcoding seems an obvious way to get a handle on its biodiversity. Rudolf has been working towards scalable and cheap barcoding, e.g. $1 DNA barcodes for reconstructing complex phenomes and finding rare species in specimen‐rich samples https://doi.org/10.1111/cla.12115 . His lab can sequence short (~300 bp) barcode sequences for around $US 0.50 per specimen. Their pipeline generates lots of data, accompanied by high quality photographs of exemplar specimens, which contribute to The Biodiversity of Singapore, a "Digital Reference Collection for Singapore's Biodiversity". This site provides a simple but visually striking way to explore Singapore's biota, and is a nice example of what Mitchell Whitelaw calls "generous interfaces". We could do with more of these for biodiversity data.

Screenshot 2018 07 20 05 01

One nice feature of regular COI DNA barcodes is that they are comparable across labs because everyone is sequencing the same stretch of DNA. With short barcodes, different groups may target different regions of the COI gene, resulting in sequences that can't be compared. For example, the 127bp mini barcodes developed in A universal DNA mini-barcode for biodiversity analysis https://doi.org/10.1186/1471-2164-9-214 are completely disjoint from the ~300bp sequenced by Meier's group (I'm trying to keep track of some of these short barcodes here: https://gist.github.com/rdmpage/4f2545eeea4756565925fb4307d9af6b.

The return of regional faunas

In the "old days" of colonial expansion it was common for taxonomists to write volume entitled "The Fauna of [insert colonised country here]". These were regional works focussing on a particular area, often motivated by trying to catalogue animals of potential economic or medical importance, as well as of scientific interest. By limiting their geographic scope, faunal treatments of taxa can sometimes be inadequate. Descriptions of new species from a particular area may be hard to compare with descriptions of species in the same group that occur elsewhere and are described by other taxonomists. It may be that to do the taxonomy of a particular group well you need to treat that group throughout its geographic range, rather then just those species in your geographic area. Hence faunas loose their scientific appeal, despite the attractiveness of having a detailed summary of the fauna of a particular area. DNA sequencing circumvents this problem by having a universally comparable character. You can sequence everything within a geographic region, but those sequences will be directly comparable to sequences found elsewhere. Barcoding makes faunas attractive again, which may help funding taxonomic research because it makes funding projects with a restricted national scope scientifically still worthwhile.

Taxonomic burden and legacy names

As we discover and catalogue more and more of the planet's biodiversity we want to stick names on that biodiversity, and this can be a significant challenge when there is a taxonomic legacy of names that are so poorly described it is hard to establish how they relate to the material we are working with. Even if you have access to the primary literature through digitisation projects like BHL, if the descriptions are poor, if the types are lost or their identity is confused (see for example A New Species of Megaselia Rondani (Diptera: Phoridae) from the Bioscan Project in Los Angeles, California, with Clarification of Confused Type Series for Two Other Species https://doi.org/10.4289/0013-8797.118.1.93 by Emily A. Hartop - who I met on this trip - and colleagues), or can't be sequenced, then these names will remain ambiguous, and potentially clogging up efforts to name the unnamed species. One approach favoured by Rudolf is to effectively wipe the slate clean, declare all ambiguous names before a certain date to be null and void, and start again. This renders (or rather, resets) the notion of priority - given two names for the same species the older name is the one to use - and so is likely to be a hard sell, but it is part of the ongoing discussion about the impact of molecular data on naming taxa. Similar discussions are raging at the moment in mycology, e.g. Ten reasons why a sequence-based nomenclature is not useful for fungi anytime soon https://doi.org/10.5598/imafungus.2018.09.01.11, yet a another reflection of how much taxonomy is driven by technology.

Friday, March 24, 2017

This is what phylodiversity looks like

Following on from earlier posts exploring how to map DNA barcodes and putting barcodes into GBIF it's time to think about taking advantage of what makes barcodes different from typical occurrence data. At present GBIF displays data as dots on a map (as do I in http://iphylo.org/~rpage/bold-map/). But barcodes come with a lot more information than that. I'm interested in exploring how we might measure and visualise biodiversity using just sequences.

Based on a talk by Zachary Tong (Going Organic - Genomic sequencing in Elasticsearch) I've started to play with n-gram searches on DNA barcodes using Elasticsearch, an open source search engine. The idea is that we break the DNA sequence into every possible "word" of length n (also called a k-mer or k-tuple, where k = n).

For example, for n = 5, the sequence GTATCGGTAACGAACTT would look like this:

GTATCGGTAACGAACTT

GTATC
 TATCG
  ATCGG
   TCGGT
    CGGTA
     GGTAA
      GTAAC
       TAACG
        AACGAA
         ACGAAC
          CGAACT
           GAACTT

The sequence GTATCGGTAACGAACTT comes from Hajibabaei and Singer (2009) who discussed "Googling" DNA sequences using search engines (see also Kuksa and Pavlovic, 2009). If we index sequences in this way then we can do BLAST-like searches very quickly using Elasticsearch. This means it's feasible to take a DNA barcode and ask "what sequences look like this?" and return an answer qucikly enoigh for a user not to get bored waiting.

Another nice feature of Elasticsearch is that it supports geospatial queries, so we can ask for, say, all the barcodes in a particular region. Having got such a list, what we really want is not a list of sequences but a phylogenetic tree. Traditionally this can be a time consuming operation, we have to take the sequences, align them, then input that alignment into a tree building algorithm. Or do we?

There's growing interest in "alignment-free" phylogenetics, a phrase I'd heard but not really followed up. Yang and Zhang (2008) described an approach where every sequences is encoded as a vector of all possible k-tuples. For DNA sequences k = 5 there are 45 = 1024 possible combinations of the bases A, C, G, and T, so a sequence is represented as a vector with 1024 elements, each one is the frequency of the corresponding 5-tuple. The "distance" between two sequences is the mathematical distance between these vectors for the two sequences. Hence we no longer need to align the sequences being comapred, we simply chunk them into all "words" of 5 bases in length, and compare the frequencies of the 1024 different possible "words".

In their study Yang and Zhang (2008) found that:

We compared tuples of different sizes and found that tuple size 5 combines both performance speed and accuracy; tuples of shorter lengths contain less information and include more randomness; tuples of longer lengths contain more information and less random- ness, but the vector size expands exponentially and gets too large and computationally inefficient.

So we can use the same word size for both Elasticsearch indexing and for computing the distance matrix. We still need to create a tree, for which we could use something quick like neighbour-joining (NJ). This method is sufficiently quick to be available in Javascript and hence can be computed by a web browser (e.g., biosustain/neighbor-joining).

Putting this all together, I've built a rough-and-ready demo that takes some DNA barcodes, puts them on a map, then enables you to draw a box on a map and the demo will retrieve the DNA barcodes in that area, compute a distance matrix using 5-tuples, then build a NJ tree, all on the fly in your web browser.

This is all very crude, and I need to explore scalability (at the moment I limit the results to the first 200 DNA sequences found), but it's encouraging. I like the idea that, in principle, we could go to any part of the globe, ask "what's there?" and get back a phylogenetic tree for the DNA barcodes in that area.

This also means that we could start exploring phylogenetic diversity using DNA barcodes, as Faith & Baker (2006) wanted a decade ago:

...PD has been advocated as a way to make the best-possible use of the wealth of new data expected from large-scale DNA “barcoding” programs. This prospect raises interesting bio-informatics issues (discussed below), including how to link multiple sources of evidence for phylogenetic inference, and how to create a web-based linking of PD assessments to the barcode–of-life database (BoLD).

The phylogenetic diversity of an area is essentially the length of the tree of DNA barcodes, so if we build a tree we have a measure of diversity. Note that this contrasts with other approaches, such as Miraldo et al.'s "An Anthropocene map of genetic diversity" which measured genetic diversity within species but not between (!).

Practical issues

There are a bunch of practical issues to work through, such as how scalable it is to compute phylogenies using Javascript on the fly. For example, could we do something like generate a one degree by one degree grid of the Earth, take all the barcodes in each cell and compute a phylogeny for each cell? Could we do this in CouchDB? What about sampling, should we be taking a finite, random sample of sequences so that we try and avoid sampling bias?

There are also data management issues. I'm exploring downloading DNA barcodes, creating a Darwin Core Archive file using the Global Genome Biodiversity Network (GGBN) data standard, then converting the Darwin Core Archive into JSON and sending that to Elasticsearch. The reason for the intermediate step of creating the archive is so that we can edit the data, add missing geospatial informations, etc. I envisage having a set of archives, hosted say on GitHub. These archives could also be directly imported into GBIF, ready for the time that GBIF can handle genomic data.

References

  • Faith, D. P., & Baker, A. M. (2006). Phylogenetic diversity (PD) and biodiversity conservation: some bioinformatics challenges. Evol Bioinform Online. 2006; 2: 121–128. PMC2674678
  • Hajibabaei, M., & Singer, G. A. (2009). Googling DNA sequences on the World Wide Web. BMC Bioinformatics. Springer Nature. https://doi.org/10.1186/1471-2105-10-s14-s4
  • Kuksa, P., & Pavlovic, V. (2009). Efficient alignment-free DNA barcode analytics. BMC Bioinformatics. Springer Nature. https://doi.org/10.1186/1471-2105-10-s14-s9
  • Miraldo, A., Li, S., Borregaard, M. K., Florez-Rodriguez, A., Gopalakrishnan, S., Rizvanovic, M., … Nogues-Bravo, D. (2016, September 29). An Anthropocene map of genetic diversity. Science. American Association for the Advancement of Science (AAAS). https://doi.org/10.1126/science.aaf4381
  • Yang, K., & Zhang, L. (2008, January 10). Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Research. Oxford University Press (OUP). https://doi.org/10.1093/nar/gkn075

Thursday, December 22, 2016

DNA barcoding taxonomy now in GBIF

220px The Face of a Lupine BlueFollowing on from adding DNA barcodes to GBIF I've now uploaded a taxonomic classification of DNA barcode BINs (Barcode Index Numbers). Each BIN is a cluster of similar DNA barcodes that is essentially equivalent to a species. For more details see:

Ratnasingham, S., & Hebert, P. D. N. (2013, July 8). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.), PLoS ONE. Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0066213
The data I've uploaded was obtained by screen scraping the BOLD web site for each BIN in the DNA barcode dataset (BOLD's API doesn't let me get all the information I want). In addition to the taxonomic hierarchy associated with each BIN I've also extracted any publications mentioned on the BIN page, and subsequently tried to link those to the corresponding DOI, if the publication has one. The code for all this is available on GitHub https://github.com/rdmpage/bold-bins, which also serves as the host for the Darwin Core Archive for this dataset. There's a neat trick where you can use a .gitattributes file to tell GitHub not store certain files in the zip file it creates for the repository (see Excluding files from git archive exports using gitattributes by @fmarier).

Having done this, I've a few thoughts.

Please, please use DOIs for articles

BOLD pages for BINs often include one or more papers that published the barcodes included in that BIN. This is great, but often links to these papers are pretty strange:

If you are going to store literature in a database treat links to articles with great care. they are often full of extraneous stuff that depends on how the user reached that article online. DOIs greatly simplify this process. Instead of a URL like http://onlinelibrary.wiley.com/store/10.1111/j.1755-0998.2009.02650.x/asset/j.1755-0998.2009.02650.x.pdf?v=1&t=hellc54c&s=e14bbc4146b66a051ad5cd1f5361ac2e16dc5831&systemMessage=Pay+Per+View+will+be+unavailable+for+upto+3+hours+from+06%3A00+EST+March+23rd+ (I kid you not) you should use the DOI 10.1111/j.1755-0998.2009.02650.x.

Adding DOIs to these articles means GBIF will display them on the corresponding species page, for example Centromerus sylvaticus (Blackwall, 1841) has links to these two papers:

Telfer, A., deWaard, J., Young, M., Quinn, J., Perez, K., Sobel, C., … Hebert, P. (2015, August 30). Biodiversity inventories in high gear: DNA barcoding facilitates a rapid biotic survey of a temperate nature reserve. Biodiversity Data Journal. Pensoft Publishers. https://doi.org/10.3897/bdj.3.e6313
Blagoev, G. A., deWaard, J. R., Ratnasingham, S., deWaard, S. L., Lu, L., Robertson, J., … Hebert, P. D. N. (2015, July 26). Untangling taxonomy: a DNA barcode reference library for Canadian spiders. Molecular Ecology Resources. Wiley-Blackwell. https://doi.org/10.1111/1755-0998.12444
Now GBIF users can easily explore what we know about barcodes from this species by going directly to the primary literature.

Dark taxa

In an earlier post I discussed dark taxa, which are taxa that lack formal scientific names. BOLD is full of these, so many of the taxa I've added to GBIF don't have Linnean names. Instead I've used a combination of higher taxon name and the BIN itself.

Composite taxa

Having said that BINs are essentially the same as species, this need not imply that there's a one-to-one match between BINs and currently recognised species (indeed, this is of the things that makes barcoding so interesting, it's ability to discover hidden variation without taxa currently considered to be a single species). This means that some BINs will have the same name (significant variation within a species), and some BINs will have multiple names (more than one species name assigned to the same BIN). For example, BOLD:AAA2525 is a cluster of DNA barcodes with the following names attached:
  • Icaricia lupini
  • Icaricia acmon
  • Icaricia neurona
  • Plebejus lupini
  • Aricia sp. RV-2009
  • Aricia acmon
  • Plebejus acmon
  • Plebejus elvira
  • Icaricia lupini texanus
  • Icaricia lupini monticola
  • Icaricia lupini chlorina
  • Icaricia lupini lupini
  • Icaricia lupini alpicola
This cluster of names includes subspecies, synonyms (e.g. ). Looking at the phylogeny for this BIN (PDF-only) some of these names are intermingled suggesting that some specimens might be misidentified, apparently Icaricia lupini and I. acmon are very similar:
Coutsis, J. G. (2011). The male genitalia of N American Icaricia lupini and I. acmon; how they differ from each other and how they compare to those of the other two members of the group, I. neurona and I. shasta (Lepidoptera: Lycaenidae, Polyommatiti). Phegea, 39(4), 144-151. Retrieved from http://biostor.org/reference/160269

Summary

This is a first attempt to integrate DNA barcode taxonomy into GBIF, so there are going to be some issues to explore. GBIF currently assumes taxa can be easily mapped to a Linnean hierarchy. While this is ultimately likely to be true for animal COI barcodes, getting there is going to be messy while we have numerous dark taxa and/or BINs which don't match the current identifications of the voucher specimens.

Perhaps it's worth asking whether attempt to fit the results of DNA barcoding into a classical taxonomy is the best way forward. In doing so we loose much of what makes barcodoing so powerful, namely a specimen-level phylogenetic tree. Maybe what we should be really thinking about is ways to explore barcoding data natively. See Notes on next steps for the million DNA barcodes map for some thoughts on how to do that.

Image from Wikimedia Commons The Face of a Lupine Blue by Ingrid Taylar.

Thursday, December 08, 2016

iBOL DNA barcodes in GBIF

I've uploaded all the COI barcodes in the iBOL public data dumps into GBIF. This is an update of an earlier project that uploaded a small subset. Now that dataset doi:10.15468/inygc6 has been expanded to include some 2.7 million barcodes. In the new GBIF portal (work in progress) the map for these barcodes looks like this:

Screenshot 2016 12 07 22 58 43

Many of these records have images of the specimens that were sequenced, and the new GBIF "gallery" feature displays these nicely, e.g.:

Screenshot 2016 12 08 10 04 00

Having done this, I've a few thoughts.

Why did I do this?

Why did I do this, or, put another why didn't iBOL do this already? In an ideal world, iBOL would be an active contributor to GBIF and would be routinely uploading barcodes. Since this isn't happening, I've gone ahead and uploaded the barcodes myself. From my perspective, I want as much data to be as discoverable and as accessible as possible, hence if need be I'll grab data from wherever it lives and add it to GBIF (for an earlier example see The Zika virus, GBIF, and the missing mosquitoes). A downside of this is that, long term, the relationship between data provider and GBIF may be as valuable to GBIF as the data, and simply grabbing and reformatting data doesn't, by itself, form that relationship. But in the absence of a working relationship I still need the data.

Where are the taxonomic names

Lots of barcodes lack formal scientific names, even though in many cases BOLD has them. The data in the public dumps often lacks this information. A next logical step would be to harvest data from the BOLD API and add taxonomic names as "identifications".

Where are the sequences?

The sequences themselves aren't in GBIF, which on the one hand is not surprising as GBIF isn't a sequence databases. However, I think it should be, in the sense that for a lot of biodiversity sequences are going to be the only way forward. This includes the eukaryote barcodes, bacterial sequences, and metabarcodes. Fundamentally sequences are just strings of letters, and GBIF already handles those (e.g., taxonomic names, geographic places, etc.). Furthermore, the following paper by Bittner et al. makes a strong case that rather than knowing "what is there?" it's more important to know "what are they doing?"

Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biology Direct. Springer Nature. https://doi.org/10.1186/1745-6150-5-47

In other words, a functional approach may matter more than a purely taxonomic approach to diversity. For a big chunk of biology this is going to depend on analysing sequences. Even if we restrict ourselves to just taxonomic diversity, there is scope for expanding our notion of what we display once we have sequences and evolutionary trees, e.g. Notes on next steps for the million DNA barcodes map.

Tuesday, May 10, 2016

Notes on next steps for the million DNA barcodes map

Some notes to self about future directions for the "million DNA barcodes map" http://iphylo.org/~rpage/bold-map/.

Screenshot 2016 05 10 13 52 09

At the moment we have an interactive map that we can pan and zoom, and click on a marker to get a list of one or more barcodes at the location. We can also filter by major taxonomic group. Here are some ideas on what could be next.

Search

At the moment search is simply browsing the map. It would be handy to be able to enter a taxon or a barcode identifier and go to the corresponding markers on the map.

What is this?

If we have a single DNA barcode I immediately want to know "what is this?" A picture may help, and I may look up the scientific name in BioNames, but perhaps the most obvious thing to do is get a phylogeny for that barcode and similar sequences. These could then be displayed on the map using the technique I described in Visualising Geophylogenies in Web Maps Using GeoJSON (see also http://dx.doi.org/10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e).

So, ideally we would:

  1. Display information about that barcode (e.g., taxonomic identification where known).
  2. Display the local phylogeny of barcodes that contains this barcode.
  3. Display that phylogeny on the map
Hence we need to be able to generate a local phylogeny of barcodes, either on the fly (retrieve similar sequences then build tree) or using a precompute global barcode phylogeny from which we pull out the local subtree.

What is there?

A question that the map doesn't really answer is "what is the diversity of a given area?". Yes there are lots of dots, and you can click on them, but what would be nice is the ability to draw a polygon on the map (like this) and get a summary of the phylogenetic diversity of barcodes within that area.

100144 drummondFor example, imagine drawing a polygon around Little Barrier Island in New Zealand. Can we effectively retrieve the data published by Drummond et al. ( Evaluating a multigene environmental DNA approach for biodiversity assessment DOI:10.1186/s13742-015-0086-1)?.

To support "what is there?" queries we need to be able to:

  1. Draw an arbitrary spatial region region on the map and retrieve a set of sequences found within that region
  2. Retrieve the phylogeny for that set of sequences
Once agin, we either need to be able to build a phylogeny for an arbitrary set of sequences on the fly, or extract a subtree. If the a global tree is available, we could compute the length of the subtree, and also compute a visual layout fairly easily (essentially with time proportional to the number of sequences).

We'd also need to decide on the best way to visualise the phylogeny for the set of sequences. Perhaps something like Krona, or something more traditional.

Screen phymmbl

Summary

There doesn't seme to be any way of getting away from the need for a global phylogeny of COI DNA barcodes if I want to extend the functionality of the map.

Tuesday, September 01, 2015

Dark taxa, drones, and Dan Janzen: 6th International Barcode of Life Conference

6thBOL logo 300x237 A little over a week ago I was at the 6th International Barcode of Life Conference, held at Guelph, Canada. It was my first barcoding conference, and was quite an experience. Here are a few random thoughts.

Attendees

It was striking how diverse the conference crowd was. Apart from a few ageing systematists (including veterans of the cladistics wars), most people were young(ish), and from all over the world. There clearly something about the simplicity and low barrier to entry of barcoding that has enabled its widespread adoption. This also helps give barcoding a cohesion, no matter what the taxonomic group or the problem you are tackling, you are doing much the same thing as everybody else (but see below). While ageing systematists (like myself) may hold their noses regarding the use of a single, short DNA sequence and a tree-building method some would dismiss as "phenetic", in many ways the conference was a celebration of global-scale phylogeography.

Standards aren't enough

And yet, standards aren't enough. I think what contributes to DNA barcoding's success is that sequences are computable. If you have a barcode, there's already a bunch of barcodes sequences you can compare yours to. As others add barcodes, your sequences will be included in subsequent analyses, analyses which may help resolve the identity of what you sequenced.

To put this another way, we have standard image file formats, such as JPEG. This means you can send me a bunch of files, safe in the knowledge that because JPEG is a standard I will be able to open those files. But this doesn't mean that I can do anything useful with them. In fact, it's pretty hard to do anything with images part from look at them. But if you send me a bunch of DNA sequences for the same region, I can build a tree, BLAST GenBank for similar sequences, etc. Standards aren't enough by themselves, to get the explosive growth that we see in barcodes the thing you standardise on needs to be easy to work with, and have a computational infrastructure in place.

Next generation sequencing and the hacker culture

Classical DNA barcoding for animals uses a single, short mtDNA marker that people were sequencing a couple of decades ago. Technology has moved on, such that we're seeing papers such as An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. As I've argued earlier (Is DNA barcoding dead?) this misses the point about the power of standardisation on a simple, scalable method.

At the same time, it was striking to see the diversity of sequencing methods being used in conference presentations. Barcoding is a broad church, and it seemed like it was a natural home for people interested in environmental DNA. There was excitement about technologies such as the Oxford Nanopore MinION™, with people eager to share tips and techniques. There's something of a hacker culture around sequencing (see also Biohackers gear up for genome editing), just as there is for computer hardware and software.

Community

The final session of the conference started with some community bonding, complete with Paul Hebert versus Quentin Wheeler wielding light sables. If, like me, you weren't a barcode, things started getting a little cult-like. But there's no doubt that Paul's achievement in promoting a simple approach to identifying organisms, and then translating that into a multi-million dollar, international endeavour is quite extraordinary.

After the community bonding, came a wonderful talk by Dan Janzen. The room was transfixed as Dan made the case for conservation, based on his own life experiences, including Area de Conservación Guanacaste where he and Winnie Hallwachs have been involved since the 1970s. I sat next to Dan at a dinner after the conference, and showed him iNaturalist, a great tool for documenting biodiversity with your phone. He was intrigued, and once we found pictures taken near his house in Costa Rica, he was able to identify the individual animals in the photos, such as a bird that has since been eaten by a snake.

Dark taxa

My own contribution to the conference was a riff on the notion of dark taxa, and mostly consisted of me trying think through how to respond to DNA barcoding. The three responses to barcoding that I came up with are:
  1. By comparison to barcoding, classical taxonomy is digitally almost invisible, with great chunks of the literature still not scanned or accessible. So, one response is to try and get the core data of taxonomy digitised and linked as fast as possible. This is why I built BioStor and BioNames, and why I continually rant about the state of efforts to digitise taxonomy.
  2. This is essentially President Obama's "bucket" approach, maybe the sane thing to do is see barcoding as the future and envisage what we could do in a sequence only world. This is not to argue that we should ignore the taxonomic literature as such, but rather lets start with sequences first and see what we can do. This is the motivation for my Displaying a million DNA barcodes on Google Maps using CouchDB, and my experiments with Visualising Geophylogenies in Web Maps Using GeoJSON. These barely scratch the surface of what can be done.
  3. The third approach is to explore how we integrate taxonomy and barcoding at global scale, in which case linking at specimen level (rather, than, say using taxonomic names) seems promising, albeit requiring a massive effort to reconcile multiple specimen identifiers.

Summary

Yes, the barcoding conference was that rare thing, a well organised (including well-fed), interesting, indeed eye-opening, conference.

Wednesday, December 17, 2014

Is DNA barcoding dead?

On a recent trip to the Natural History Museum, London, the subject of DNA barcoding came up, and I got the clear impression that people at the NHM thought classical DNA barcoding was pretty much irrelevant, given recent developments in sequencing technology. For example, why sequence just COI when you can use shotgun sequencing to get the whole mitogenome? I was a little taken aback, although this is a view that's getting some traction, e.g. [1,2]. There is also the more radical view that focussing on phylogenetics is itself less useful than, say, "evolutionary gene networks" based on massive sequencing of multiple markers [3].

At the risk of seeming old-fashioned in liking DNA barcoding, I think there's a bigger issue at stake (see also [4]). DNA barcoding isn't simply a case of using a single, short marker to identify animal species. It's the fact that it's a globalised, standardised approach that makes it so powerful. In the wonderful book "A Vast Machine" [5], Paul Edwards talks about "global data" and "making data global". The idea is that not only do we want data that is global in coverage ("global data"), but we want data that can be integrated ("making data global"). In other words, not only do we want data from everywhere in the world, say, we also need an agreed coordinate system (e.g., latitude and longitude) in order to put each data item in a global context. DNA barcoding makes data global by standardising what a barcode is (a given fragment of COI), and what metadata needs to be associated with a sequence to be a barcode (e.g., latitude and longitude) (see, e.g. Guest post: response to "Putting GenBank Data on the Map"). By insisting on this standardisation, we potentially sacrifice the kinds of cool things that can be done with metagenomics, but the tradeoff is that we can do things like put a million barcodes on a map:

Bold
To regard barcoding as dead or outdated we'd need an equivalent effort to make metagenomic sequences of animals global in the same way that DNA barcoding is. Now, it may well be that the economics of sequencing is such that it is just as cheap to shotgun sequence mitogenomes, say, as to extract single markers such as COI. If that's the case, and we can get a standardised suite of markers across all taxa, and we can do this across museum collections (like Hebert et al.'s [6] DNA barcoding "blitz" of 41,650 specimens in a butterfly collection), then I'm all for it. But it's not clear to me that this is the case.

This also leaves aside the issue of standardising other things's much as the metadata. For instance, Dowton et al. [2] state that "recent developments make a barcoding approach that utilizes a single locus outdated" (see Collins and Cruickshank [4] for a response). Dowton et al. make use of data they published earlier [7,8]. Out of curiosity I looked at some of these sequences in GenBank, such as JN964715. This is a COI sequence, in other words, a classical DNA barcode. Unfortunately, it lacks a latitude and longitude. By leaving off latitude and longitude (despite the authors having this information, as it is in the supplemental material for [7]) the authors have missed an opportunity to make their data global.

For me the take home message here is that whether you think DNA barcoding is outdated depends in part what your goal is. Clearly barcoding as a sequencing technology has been superseded by more recent developments. But to dismiss it on those grounds is to miss the bigger picture of what is a stake, namely the chance to have comparable data for millions of samples across the globe.

References

  1. TAYLOR, H. R., & HARRIS, W. E. (2012, February 22). An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. Molecular Ecology Resources. Wiley-Blackwell. doi:10.1111/j.1755-0998.2012.03119.x
  2. Dowton, M., Meiklejohn, K., Cameron, S. L., & Wallman, J. (2014, March 28). A Preliminary Framework for DNA Barcoding, Incorporating the Multispecies Coalescent. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu028
  3. Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biol Direct. Springer Science + Business Media. doi:10.1186/1745-6150-5-47
  4. Collins, R. A., & Cruickshank, R. H. (2014, August 12). Known Knowns, Known Unknowns, Unknown Unknowns and Unknown Knowns in DNA Barcoding: A Comment on Dowton et al. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu060
  5. Edwards, Paul N. A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. MIT Press ISBN: 9780262013925
  6. Hebert, P. D. N., deWaard, J. R., Zakharov, E. V., Prosser, S. W. J., Sones, J. E., McKeown, J. T. A., Mantle, B., et al. (2013, July 10). A DNA “Barcode Blitz”: Rapid Digitization and Sequencing of a Natural History Collection. (S.-O. Kolokotronis, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0068535
  7. Meiklejohn, K. A., Wallman, J. F., Pape, T., Cameron, S. L., & Dowton, M. (2013, October). Utility of COI, CAD and morphological data for resolving relationships within the genus Sarcophaga (sensu lato) (Diptera: Sarcophagidae): A preliminary study. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2013.04.034
  8. Meiklejohn, K. A., Wallman, J. F., Cameron, S. L., & Dowton, M. (2012). Comprehensive evaluation of DNA barcoding for the molecular species identification of forensically important Australian Sarcophagidae (Diptera). Invertebrate Systematics. CSIRO Publishing. doi:10.1071/is12008

Friday, August 15, 2014

Some design notes on modelling links between specimens and other kinds of data

If we view biodiversity data as part of the "biodiversity knowledge graph" then specimens are a fairly central feature of that graph. I'm looking at ways to link specimens to sequences, taxa, publications, etc., and doing this across multiple data providers. Here are some rough notes on trying to model this in a simple way.

For simplicity let's suppose that we have this basic model:

Core

A specimen comes from a locality (ideally we have the latitude and longitude of that locality), it is assigned to a taxon, we have data derived from that specimen (e.g., one or more DNA sequences), and we have one or more publications about that specimen (e.g., a paper that publishes a taxon name for which the specimen is a type, or a paper that publishes a sequence for which the specimen is a voucher).

Ncbi

NCBI


In GenBank we have sequences that have accession numbers, and these are linked to taxa (identified by NCBI tax ids). A nice feature of sequence databases is that taxa are explicitly defined by extension, that is, a taxon is the set of sequences assigned to a given taxon. Most (but not all, see Miller et al. doi:10.1186/1756-0500-2-101) sequences are also linked to a publication, which will usually have a PubMed id (PMID), and sometimes a DOI. Many sequences are also georeferenced (see Guest post: response to "Putting GenBank Data on the Map"). Most sequences aren't linked to a voucher specimen, but there is the implict notion of a source (in RDF-speak, many specimens are "blank nodes" Blank nodes for specimens without URI). Some sequences are associated with a specimen that has a museum code, and some are explicitly linked to the specimen by a URL.

Barcodes

DNA barcodes


Barcodes, as represented in BOLD are similar to sequences in GenBank. We have explicit taxa ("BINs") each of which has a URL, some also having DOIs. Most barcodes are georeferenced. There's some ambiguity about whether the URL for a barcode record identifies the barcode sequence, the specimen, or both. There may be a voucher code for the specimen. Some barcodes are linked to publications, but not (as far as I can see) in the data obtained from the API. Some barcodes are linked to the corresponding record in GenBank (which may or may not be supressed, see Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)).

Gbif

GBIF


At it's core GBIF has occurrence records (many of these are specimen-based, but the majority of data in GBIF is actually observation-based), each of which has a unique id, and which is linked to a taxon, also with a unique id. As with the sequence databases, a taxon is a set of occurrences that have been assigned to that taxon. Many records in GBIF are georeferenced. There are limited cross links to other database - some occurrences list associated GenBank sequences. Some GBIF occurrences actually are sequences (e.g., the European Molecular Biology Laboratory Australian Mirror and the soon to be indexed Geographically tagged INSDC sequences), and barcodes are also making their way into GBIF (e.g., Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data). Links to publications are limited.

Museum

Museums and herbaria


Some individual natural history collections which are online provide specimen-level web pages and URLs (some even have DOIs, see DOIs for specimens are here, but we're not quite there yet), and some museums list associated GenBank sequences. In the diagram I've not linked the specimens to a taxon, because most specimens are tagged by a name, not an explicit taxon concept (unlike NCBI, BOLD, or GBIF).

Literature

Literature


Literature databases (represented here by BioStor, but could be other sources such as ZooKeys, for example) may contain articles that mention specimen codes. These articles may also mention taxon names, and geographic localities (including coordinates) (see, for example, Linking GBIF and the Biodiversity Heritage Library. Mining text for names, specimens, and localities is fairly easy, but linking these together is harder (i.e., this specimen is of this taxon, and was found at this locality).

Linking together


If we have these separate sources and this trivial model, then we can imagine trying to tie information about the same specimen together across the different databases. Why might we want to do this. Here are three reasons:

  1. Augmentation Combining information can enhance our understanding of a specimen. Perhaps a specimen in GBIF is a geographic outlier. A publication that mentions the specimen includes it in a new taxon, perhaps discovered by sequencing DNA extarcted from that specimen. Linking this information together resolves the problematic distribution.
  2. Provenance What is the evidence that a particular specimen belongs to a particualr taxon, or was collected at a particular locality? If we connect specimens to the literature we we can review the evidence for ourselves. If we have sequences we can run BLAST, build a tree, and see if we should rethink our classification of that sequence. Imagine being able to browse GBIF and see the evidence for each dot on the map?
  3. Citation Mentions in the literature, use as vouchers for DNA barcoding or other forms of sequencing can be thought of a "citation" of that specimen. Museums hosting that material could use metrics base don this to demonstrate the value of their collection (see also The impact of museum collections: one collection ≈ one Nobel Prize).
Model

Making the links


All this is well and good, the trick is to actually make the links. Here things get horribly messy very quickly. Museum specimens are cited in inconsistent ways, we don't have widely used unique, resolvable specimen identifiers, and even if we did have these identifiers we don't have a global discovery mechanism for matching voucher codes to identifiers. GBIF would be an obvious part of a "global discovery mechanism" (bit like CrossRef but for specimens), GBIF can have multiple records for the same specimen. Sometimes this is because GBIF not only aggregates data from primary sources (such as museums) but also other aggregations which may themselves already include specimens harvested from primary sources. GBIF can also have multiple records because museums keep messing with their databases, try new variants of the Darwin Core triple, etc., resulting in records that look "new" to GBIF. Whole collections can be duplicate din this way.

One way to tackle this multiplicity of specimen records is to think in terms of "clusters" of specimens that are, in some sense, the same thing across multiple databases. For example, clustering a set of duplicated GBIF records together with the sequences derived from those specimens, perhaps including a DNA barcode, and a list of papers that mention that specimen. This is represented by the yellow bar through the diagram, it connects all the different pieces of information about a specimen into a single cluster. More *cough* later.

Monday, March 10, 2014

Displaying a million DNA barcodes on Google Maps using CouchDB

BarcodeFollowing on from the previous post on putting GBIF data onto Google Maps, I'm now going to put DNA barcodes onto Google Maps. You can see the result at http://iphylo.org/~rpage/bold-map/, which displays around 1.2 million barcodes obtained from the International Barcode of Life Project (iBOL) releases. Let me describe how I made it.

Tiles


Typically when people put markers on a Google Map it is done in Javascript using the Google Maps API, and all the work is done by the browser. This works well if you haven't got too many points, but once you have more than, say, a few hundred, the browser struggles to cope. Hence, for something like GBIF or DNA barcodes where we have millions of records we need a different approach. In the GBIF data example I discussed previously I used tiles supplied by GBIF. Tiles are the basis of the "slippy maps" used by Google and others to create the experience of beig able to zoom in and out at any point on the map. At any time, the map you see in the web browser is made up of a small number of images ("tiles") that are typically 256 × 256 pixels in size. At the lowest zoom level (0) the world is represented by a single tile:

WorldTile
If you zoom in to the next level (1), the world now covers 41=4 tiles, zoom in again and the world now covers 42 = 16 tiles, and so on.

TileCoordinates
At each zoom level the tiles cover a smaller part of the world, and have increasing detail, so the user has the experience of zooming in closer and closer to an ever larger and more detailed map. But the browser only ever has to display enough 256 × 256 pixel tiles to fill the browser window.

Not only can we have tiles for the map of the world, we can also have tiles for data that we want to display on that map. For example, GBIF can create tiles for the hundreds of millions of occurrences in its database, so what looks like a giant map of millions of records is actually just a set of 256 x 256 tiles coloured to represent the number of records at each position on the tile.

DNA Barcodes


I wanted to make a map for DNA barcodes, but unfortunately there aren't any tiles I can use to create the map. So, I set about making my own. Here's what I did. Firstly, I downloaded the DNA barcode data from the BOLD site, and put the barcodes into a CouchDB database hosted by Cloudant. I simply parsed the tab-delimited data files and created a JSON document for each barcode, and stored that in CouchDB.

I then created a view in CouchDB that generated data for each tile. Each zoom level has its own tiles (for zoom level n there are 4n tiles). There's a nice web page on the Open Street Map wiki that describes how to compute slippy map tilenames. Here's the code I use to generate the CouchDB view:


function(doc) {
var tile_size = 256;
var pixels = 4;

if (doc.lat && doc.lon) {

for (var zoom = 0; zoom < 7; zoom++) {

var x_pos = (parseFloat(doc.lon) + 180)/360
* Math.pow(2, zoom);
var x = Math.floor(x_pos);

var relative_x = Math.round(tile_size * (x_pos - x));

var y_pos = (1-Math.log(Math.tan(parseFloat (doc.lat)
* Math.PI/180) + 1/Math.cos(parseFloat(doc.lat)
* Math.PI/180))/Math.PI)/2
* Math.pow(2,zoom);
var y = Math.floor(y_pos);
var relative_y = Math.round(tile_size * (y_pos - y));

relative_x = Math.floor(relative_x / pixels) * pixels;
    relative_y = Math.floor(relative_y / pixels) * pixels;

var tile = [];
tile.push(zoom);
tile.push(x);
tile.push(y);
tile.push(relative_x);
tile.push(relative_y);

emit(tile, 1);
}
}
}

For each zoom level that I support (0 to 6) I convert the latitude and longitude of the DNA barcode sample into the coordinates of the corresponding 256 × 256 tile. I then compute the position of the sample within that tile. This is rounded to the nearest 4 pixels, which is the size of marker I've chosen to display. As an example, the barcode AMSF292-10.COI-5P has location latitude -77.8064, longitude 177.174, which for a zoom level of 6 places it in tile [63,54].

To display the marker I also need to know where in the 256 × 256 tile I should draw the marker. Coordinates in tiles start at the top left corner:

PixelCoordinates
For the example above, the marker would be at x = 124, y = 200. This means that this barcode would have a key of [6, 63, 54, 124, 200] in the CouchDB view computed above. If we ignore the position within the tile, then this barcode belongs on tile [6, 63, 54].

To display the barcodes I added a layer to the Google Map. Whenever a map tile is drawn by Google maps, it calls a simple web service I created, and sends to that service the zoom level and tile coordinates for the tile it wants to draw. I then lookup that key [zoom, x, y] in CouchDB, and return all the points within that tile as a 256 x 256 SVG image. Google Maps then draws that over its own tile, and as a result the user sees both the Google Map and the location of the barcodes. To keep things manageable I only generate tiles down to zoom level 6. After that, the barcodes simply disappear.

So, with some fairly trivial coding, I've created a map tile server in CouchDB that displays over a million barcodes in Google Maps.

Barcodebig

Hit testing


Of course, a map by itself is a bit boring. What you want to do is be able to click on a point and get some more information. If you are using the Google Maps API to add markers, then it's pretty easy to handle user clicks and do something with them. But because I'm using tiles I can't use that approach. Instead what I do is capture any clicks on the map, convert that click to tile coordinates, then query CouchDB to see if any barcodes full within that location. If so, I display them down the right side of the map. It's a bit finicky about where you click on the map, but seems to work. It would be fun to extend this approach to the GBIF map, so that you could click on a point and see what GBIF says is there.

Summary


This is all a bit crude, but as far as I'm aware this is the only interactive map of all DNA barcodes (at least, the publicly available animal barcodes). There's a lot more that could be done with this, but for now it's functional. Take it for a spin at http://iphylo.org/~rpage/bold-map/, I'd welcome any comments. If you are curious about the technical details, the source code is on GitHub at https://github.com/rdmpage/bold-map.

Thursday, December 12, 2013

Guest post: response to "Putting GenBank Data on the Map"

DES Tahiti 09 biggerThe following is a guest blog post by David Schindel and colleagues and is a response to the paper by Antonio Marques et al. in Sciencedoi:10.1126/science.341.6152.1341-a.

Marques, Maronna and Collins (1) rightly call on the biodiversity research community to include latitude/longitude data in database and published records of natural history specimens. However, they have overlooked an important signal that the community is moving in the right direction. The Consortium for the Barcode of Life (CBOL) developed a data standard for DNA barcoding (2) that was approved and implemented in 2005 by the International Nucleotide Sequence Database Collaboration (INSDC; GenBank, ENA and DDBJ) and revised in 2009. . All data records that meet the requirements of the data standard include the reserved keyword 'BARCODE'. The required elements include: (a) information about the voucher specimen from which the DNA barcode sequence was derived (e.g., species name, unique identifier in a specimen repository, country/ocean of origin); (b) a sequence from an approved gene region with minimum length and quality; and (c) primer sequences and the forward and reverse trace files. Participants in the workshops that developed the data standard decided to include latitude and longitude as strongly recommended elements but not as strict requirements for two reasons. First, many voucher specimens from which BARCODE records are generated may have been collected before GPS devices were available. Second, barcoding projects such as the Barcode of Wildlife Project (4) are concentrating on rare and endangered species. Publishing the GPS coordinates of collecting localities would facilitate illegal collecting and trafficking that could contribute to biodiversity loss.

The BARCODE data standard is promoting precisely the trend toward georeferencing called for by Marques, Marrona and Collins. Table 1 shows that there are currently 346,994 BARCODE records in INSDC (3). Of these BARCODE records, 83% include latitude/longitude data. Despite not being a required element in the data standard, this level of georeferencing is much higher than for all cytochrome c oxidase I gene (COI), the BARCODE region, 16S rRNA, and cytochrome b (cytb), another mitochondrial region that was used used for species identification prior to the growth of barcoding. Data are also presented on the numbers and percentages of data records that include information on the voucher specimen from which the nucleotide sequence was obtained. In an increasing number of cases, these voucher specimen identifiers in INSDC are hyperlinked to the online specimen data records in museums, herbaria and other biorepositories. Table 2 provides these same data for the time interval used in the Marques et al. letter (1). These tables indicate the clear effect that the BARCODE data standard is having on the community’s willingness to provide more complete data documentation.

Table 1. Summary of metadata for GPS coordinates and voucher specimens associated with all data records.
Categories of data recordsTotal number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE347,349286,975 (83%)347,077 (~100%)
All COI751,955365,949 (49%)531,428 (71%)
All 16S4,876,284461,030 (9%)138,921 (3%)
All cytb239,7967,776 (3%)84,784 (35%)

Table 2.
Summary of metadata for GPS coordinates and voucher specimens associated with data records submitted between 1 July 2011 and 15 June 2013.
Total number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE160,615132,192 (82%)160,615 (100%)
All COI302,507166,967 (55%)231,462 (77%)
All 16S1,535,364232,567 (15%)49,150 (3%)
All cytb74,6312,920 (4%)24,386 (33%)


The DNA barcoding community's data standard is demonstrating two positive trends: better documentation of specimens in natural history collections, and new connectivity between databases of species occurrences and DNA sequences. We believe that these trends will become standard practices in the coming years as more researchers, funders, publishers and reviewers acknowledge the value of, and begin to enforce compliance with the BARCODE data standard and related minimum information standards for marker genes (5).

DAVID E. SCHINDEL1, MICHAEL TRIZNA1, SCOTT E. MILLER1, ROBERT HANNER2, PAUL D. N. HEBERT2, SCOTT FEDERHEN3, ILENE MIZRACHI3
  1. National Museum of Natural History, Smithsonian Institution Smithsonian Institution, Washington, DC 20013–7012, USA.
  2. University of Guelph, Ontario, Canada
  3. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

References

  1. Marques, A. C., Maronna, M. M., & Collins, A. G. (2013). Putting GenBank Data on the Map. Science, 341(6152), 1341–1341. doi:10.1126/science.341.6152.1341-a
  2. Consortium for the Barcode of Life, http://www.barcodeoflife.org/sites/default/files/DWG_data_standards-Final.pdf (2009)
  3. Data in Tables 1 and 2 were drawn from GenBank (http://www.ncbi.nlm.nih.gov/genbank/) [data as of 1 October 2013]
  4. Barcode of Wildlife Project, http://www.barcodeofwildlife.org (2013)
  5. Yilmaz, P., Kottmann, R., Field, D., Knight, R., Cole, J. R., Amaral-Zettler, L., Gilbert, J. A., et al. (2011). Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology, 29(5), 415–420. doi:10.1038/nbt.1823

Thursday, July 11, 2013

Barcode Index Number (BIN) System in DNA barcoding explained

Journal pone 0066213 g001Quick note to highlight the following publication:
Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.)PLoS ONE, 8(7), e66213. doi:10.1371/journal.pone.0066213
This paper outlines the methods used by the BOLD project to cluster sequences into "BINS", and touches on the issue of dark taxa (taxa that are in GenBank but which lack formal scientific names). Might be time to revisit the dark taxa idea, especially now that I've got a better handle on the taxonomic literature (see BioNames) where the names of at least some dark taxa may lurk.

Sunday, April 07, 2013

DNA QR Codes



Came across this paper recently:

Liu, C., Shi, L., Xu, X., Li, H., Xing, H., Liang, D., Jiang, K., et al. (2012). DNA Barcode Goes Two-Dimensions: DNA QR Code Web Server. (R. DeSalle, Ed.)PLoS ONE, 7(5), e35146. doi:10.1371/journal.pone.0035146

Despite QR Codes being uncool, there's something appealing about the idea of compressing a DNA barcode sequence into a small image. Imagine having a specimen label with a QR Code, pointing a smart phone at the label using an app that converts the QR Code to a sequence, sends it to BLAST and returns a phylogeny that includes DNA from that specimen (perhaps using a service like http://iphylo.org/~rpage/phyloinformatics/blast).

Tuesday, April 24, 2012

Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)

Dark taxa have become even darker. NCBI has pulled the plug on large numbers of DNA barcode sequences that lack scientific names. For example, taxon Cyclopoida sp. BOLD:AAG9771 (tax_id 818059) now has a sparse page that has no associated sequences. From an earlier download of EMBL I know that this taxon is associated with at least 5 sequences, such as GU679674. But if you go to that sequence you get this:

Obsolete

So the the sequence is hidden. You can retrieve it by clicking on the obsolete version link, but by default it is hidden.

It's an extraordinary state of affairs that a huge slice of fundamental biodiversity data has been effectively "pulled" from view.

UpdateSujeevan Ratnasingham from iBOL has pointed out that the sequence I'd used above (GU679674) was not one of the ones hidden by NCBI, rather it was suppressed at the request of the investigator (which I'd have realised if I'd paid more attention to the screenshot). HQ918317 is an example of a BOLD record that was suppressed:

Hq