Thursday, June 05, 2025

A metabarcoding mess and the importance of just looking at the data

How to cite: Page, R. (2025). A metabarcoding mess and the importance of just looking at the data. https://doi.org/10.59350/q2v8n-wc488

Here I summarise a few posts on Bluesky where I raised concerns about some metadabarcoding datasets that were highlighted by GBIF:

>3.4 million insect records based on DNA metabarcoding of bulk samples from #Sweden and #Madagascar have been mobilized to GBIF thanks to collaborative efforts of research institutions led by the #NaturhistoriskaRiksmuseet link

Looking at these datasets it’s clear that something is wrong.

Data

The datasets discussed are for CO1 Amplicon Sequence Variants from Madagascar, which are part of the Insect Biome Atlas project. The data is described in Miraldo et al. https://doi.org/10.1038/s41597-025-05151-0. There are two datasets for Madagascar:

  • CO1 Amplicon Sequence Variants of leaf litter arthropod communities collected at Malaise traps from the Insect Biome Atlas project in Madagascar https://doi.org/10.15468/pad7pc
  • CO1 Amplicon Sequence Variants of bulk arthropod samples (mild lysis) collected with Malaise traps from the Insect Biome Atlas project in Madagascar https://doi.org/10.15468/6u5rum

In case the data changes in the future I’ve made snapshots of the two datasets and uploaded them to Zenodo doi:10.5281/zenodo.15599342. The files I downloaded (https://doi.org/10.15468/dl.kwjyjt and https://doi.org/10.15468/dl.2p3z5q) are the GBIF annotated archives, hence they include the mapping between the taxonomic names and GBIF’s backbone taxonomy.

Problem

In browsing the data on GBIF I noticed some striking distribution patterns: insects normally found in Europe and/or North America were also turning up in Madagascar, based solely on these metabarcoding datasets. For example, Helina impuncta.

Helina impuncta

Metadata barcoding data can be a complicated beast, especially if you try and navigate the multiple databases that house metadata on the sampling program and the output of sequencing machines. For example, GBIF occurrence 5162479277 is linked to ENA record ERR12944764 which in turn has multiple identifier links:

Study Accession Sample Accession Experiment Accession Run Accession Tax Id
PRJEB61109 SAMEA115499645 ERX12317105 ERR12944764 1234904

What’s nice about the GBIF datasets that they wrap all this up into a single package that we can explore. BLASTing a few sequences in these datasets suggests that the identifications of these sequences were probably correct, so the source of the problematic maps lies elsewhere.

Lots of maps

I wrote a simple PHP script to read the GBIF dataset, aggregate the GBIF taxon ids (i.e., the GBIF taxa that the sequences were mapped to) and draw a map for each taxon (code is on GitHub) . These maps use GBIF’s maps API to retrieve a tile (256 x 256 pixels) showing the distribution of each taxon on a global map (i.e., zoom level 0 on a tiled web map). I overlay that on a GBIF base map tile (see Base Map Tiles), and dump the output as HTML.

This is crude but gives a quick visual overview of the data. For the litter datasets there are a lot of these Euro-Madagascar distributions:

litter

For the malaise trap data the results look much more like what I’d expect, lots of taxa restricted to Madagascar.

malaise

But there are still examples of the problematic pattern mentioned above.

What happened?

In the paper describing the data there is a paragraph discussing contamination:

As part of data clean-up, it is usually advised to remove ASVs present in negative controls, or the maximum number of reads for those, from the entire dataset71. However, after careful inspection of our negative controls, we noticed that only a few ASVs were persistently showing up in control samples. The majority of ASVs seemed to be arthropod sequences that were present in the bulk samples, and also sporadically present in negative controls in relatively small numbers. This was presumably due to DNA spreading between samples through tiny droplets during sample processing, or to low-level of “index hopping”, leading to incorrect assignment of reads during sequencing, despite the use of double-unique indexes in library preparation72. link

The paper goes on to discuss possible examples of contamination. Looking at the results I suspect there has been a lot more contamination than the authors allow, especially for the litter dataset.

Summary

These results are preliminary, and I’ve contacted the authors of the paper to see if we can find out what happened. But for me the most obvious conclusions are:

  • Metabarcoding has the potential to generate a lot of spurious records that may negatively impact databases such as GBIF.
  • One of the great features of GBIF is that it enables you to simply look at the data. In an age of automated pipelines and big data I think visualisation is increasingly important. It’s often an easy way to discover that something is not as it should be.

References

Miraldo, A., Sundh, J., Iwaszkiewicz-Eggebrecht, E. et al. Data of the Insect Biome Atlas: a metabarcoding survey of the terrestrial arthropods of Sweden and Madagascar. Sci Data 12, 835 (2025). https://doi.org/10.1038/s41597-025-05151-0

Page, R. (2025). Snapshot of Insect Biome Atlas data for Madagascar from GBIF [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15599342

Written with StackEdit.