Tuesday, March 10, 2026

Using AI to understand a DNA barcoding mystery

As I continue to add features to BOLD-View I keep coming across interesting cases where something seems not quite right with the data. Typically this prompts further investigation, which typically means going down a rabbit hole. As an example, take barcode GMAEA6199-22 which was collected by Donald Hobern in Australia and is identified to order level as Strepsiptera (I thank Donald for this example).

This barcode falls within a cluster which contains sequences labelled as either strepsipteran (in many cases identified down to species, Elenchus varleyi) or hempiteran. Almost all the strepsipteran sequences are assigned to a BIN BOLD:ACH2898. The hemipteran sequences are not assigned to a BIN, even though they are essentially identical to the strepsipteran sequences.

A new feature I’ve added to BOLD View’s display for a barcode is a map of all similar sequences, grouped by BINs. For this example the Stepsiptera BIN has a wide geographic range, including central America and South Africa, as well as Australia (through GMAEA6199-22 which is a stepsipteran but not included in the BIN).

So, we have a widespread BIN, likely to be Elenchus varleyi, which has identical sequences to those labelled as hemipteran. What is going on?

Now if you know anything about Strepsiptera you’ll know that they have a pretty bonkers biology, being endoparasites of insects, and male and females have quite different life histories. But as an exercise I decided to ask Claude to see what it made of this situation. Here is the prompt I used.

I have an interesting DNA barcoding case I want you to look at. I am uploading part of a phylogeny for DNA barcodes for a strepsiteran Elenchus varleyi. Note that the barcodes mix sequences from Elenchus and Hemiptera. The barcodes come from samples in Australia, South Africa, and central America. I am uploading a Markdown file for a paper with the DOI 10.1071/it9890175. If you need to get information on species distributions then https://gbif.org is a good source. Given all this information, can you suggest explanations for why we have barcode sequences from Elenchus and Hemiptera, why only some sequences are assigned to a BIN, and why are these barcodes widely distributed.

I also uploaded the tree picture you see above, and the paper “Descriptions and biological notes of the Australian Elenchidae (Strepsiptera)” doi:10.1071/IT9890175 which described Elenchus varleyi (I got the DOI for the paper from another of my projects, BioNames.

The paper is behind a paywall, and isn't on SciHub, but is on [ResearchGate](https://www.researchgate.net/publication/248899086), so I downloaded the PDF and used Marker to convert it to Markdown, which has become the default language of choice for sending documents to AIs.

Claude thought for a bit and then came back with a summary that basically said:

  • the mixture of strepsipteran and hemipteran sequences is likely due to the “hemipteran” barcodes actually being for their strepsipteran parasites rather than the hemipterans themselves.
  • give that one of the hosts of Elenchus varleyi is Sogatella kolophon, which is widespread, that would account for the occurrence of essentially identical DNA sequences across separate continents.

Claude noted that a single hemipteran can host multiple strepsipteran individuals, increasing the chance of mistakenly sequencing the pareasite rather than the host. The picture below shows adult neotenic females ♀ of the Elenchus varleyi in the abdomen of the hemipteran Sogatella kolophon. (Fig. 14 from doi:10.1071/IT9890175).

Now, you and I could figure this out with some background biology and reading the paper, but the point here is that Claude could do it more or less automatically, which means we could potentally scale this to look at all “odd” or “anomalous” barcodes and get suggestions as to what has happened. Hence we have the possibility for large-scale database curation (see also “CurateGPT: A flexible language-model assisted biocuration tool” doi:10.48550/arXiv.2411.00046).

That said, we need some things in place. It helps to have links to information on the species and sequences involved, e.g. what paper(s) described the species, what paper(s) published the sequences. In this case I used BioNames to find the DOI for the paper describing Elenchus varleyi which was behind a paywall. That is likely to be an obstacle to this approach.

Out of curiousity I explored an alternative scenario, in part inspired by Plazi. Imagine that we can’t share the full text of the paper, but we do have access to the facts or claims the paper makes. Plazi doesn’t quite do this, it extract images and the text of taxonomic treatments, but let’s imagine we could extract a series of triples from the paper, such as:

  • (Elenchus varleyi, parasitises, Sogatella kolophon)

Given a set of these we could potentially still infer the cause of the barcoding patterns we were seeing, but without needing access to the paywalled text (we’d need access to the text to extract the triples, but Claude wouldn’t need to see the text). Of course, this is not a new idea (Peter Murray-Rust has been advocating for this for a long time).

But what triples should we extract from a given paper. Out of interest I asked Claude what triples it would need to generate the answer it gave me to the strepsiptera/hemiptera example. Here is what it gave me:
Parasite–host relationships:

  • (Elenchus varleyi, parasitises, Sogatella kolophon)
  • (Elenchus varleyi, parasitises, Toya drope)
  • (Elenchus varleyi, host_family, Delphacidae)
  • (Sogatella kolophon, is_majority_host_of, Elenchus varleyi)

Life cycle and tissue intimacy (critical for the contamination argument):

  • (Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
  • (Elenchus varleyi female, life_strategy, neotenic/larviform)
  • (Elenchus varleyi female, spends_entire_life_in, host body)
  • (Elenchus varleyi male, develops_as, endoparasitic larva)
  • (Elenchus varleyi female, extrudes_only_in, adult hosts)

Superparasitism (strengthens contamination argument):

  • (Elenchus varleyi, exhibits, superparasitism)
  • (superparasitism, maximum_observed, six parasitoids per host)
  • (superparasitism, frequency, common in Brisbane)

Dispersal mechanism:

  • (Elenchus varleyi female, dispersed_by, macropterous adult hosts)
  • (Elenchus varleyi larvae, dispersed_by, macropterous adult hosts)
  • (Elenchus varleyi male pupa, extrudes_in, both nymphs and macropterous adults)

Taxonomic and geographic scope:

  • (Elenchus, distribution, cosmopolitan)
  • (Elenchus, only_parasitises, Delphacidae)
  • (Elenchus varleyi, type_locality, Brisbane Queensland)
  • (Elenchus varleyi, distribution, Queensland NSW Canberra Northern Territory)

Supporting observation (supporting wide distribution argument):

  • (Perkins 1907, observed, Elenchus from widely separated countries show very little difference)

Now I think that some of these triples we could dispense with. I asked Claude and it said the following would be enough:

  • (Elenchus varleyi, parasitises, Sogatella kolophon)
  • (Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
  • (Elenchus varleyi, host_family, Delphacidae)
  • (Elenchus varleyi female, dispersed_by, macropterous adult hosts)
  • (Elenchus, distribution, cosmopolitan)

These are fairly simple statements that could liklely be extracted from the paper. That raises issues of how we do that, and how we express those triples. I asked Claude for suggested vocabularies, it mentioned Darwin Core and the Relations Ontology, among others. I am congenitally averse to big, verbose ontologies so I’d prefer something light weight, and maybe Darwin Core would be enough?

So, in summary, I’m encouraged by the way Claude suggested a plausible explanation for the pattern in the barcode tree, and that it might not always need access to full text to do so (although I suspect giving an LLM access to full text is likely to beat giving it a set of triples that might not encompass all the relevant information in the paper). This also gives me a further incentive to work on the problem of providing context for each barcode, especially the scientific papers that published the sequences, and the papers that published the taxonomic names.

But one problem still remains. How do we get all this information back into BOLD so that a user looking at these sequences knows what is going on, knows that "Hemiptera" doesn't mean "Hemiptera" in this case, and that what we are seeing is a case of a widespread insect host being infected by a widespread parasite, which was originally described from Australia. The ability to add annotations and thrid party analyses will become crucial if people are to get the most out of DNA barcoding databases.

References

  • Caufield, H., Kroll, C., O’Neil, S. T., Reese, J. T., Joachimiak, M. P., Hegde, H., Harris, N. L., Krishnamurthy, M., McLaughlin, J. A., Smedley, D., Haendel, M. A., Robinson, P. N., & Mungall, C. J. (2024). CurateGPT: A flexible language-model assisted biocuration tool (arXiv:2411.00046). arXiv. https://doi.org/10.48550/arXiv.2411.00046
  • Kathirithamby, J. (1989). Descriptions and biological notes of the Australian Elenchidae (Strepsiptera). Invertebrate Taxonomy, 3(2), 175–195. https://doi.org/10.1071/it9890175

Written with StackEdit.

Sunday, February 15, 2026

GBIF Geocoder: using GBIF to find places on a map

I’ve relaunched a “toy” tool that I made a while ago to help geocode localitiies using GBIF. Geocoding converts a text string, such as “Cambodia: Ratanakiri Province” into latitude and longitude coordinates. For some reason, the biodiversity community typically refers to this as “georeferencing”, which is usually defined as locating an image of a map (see Wikipedia entry for georeferencing, and Allmaps for some great examples).

You can try GBIF Geocoder at https://rdmpage.github.io/gbif-geocoder/. Code is available on GitHub at https://github.com/rdmpage/gbif-geocoder.

The idea behind the “GBIF Geocoder” is that GBIF has a huge number of geocoded specimens, and hence if you are looking for coordinates for a locality there is a good chance that somebody has already found them. So, all we need to do is search GBIF for specimens with localities that match the place you are trying to geocode. I created a version of this tool in 2018, mentioning it in a blog post GBIF at 1 billion - what’s next?, and wrote it up in a short note in bioRxiv Geocoding genomic databases using GBIF.

The original version was hosted on Glitch, a wonderful platform where people to create pretty much anything using HTML and Javascript. Glitch is no more so I’ve finally got around to rebuilding it, inspired by this post on Bluesky by Tapani Hopkins:

Next quest: figure out from this map where "La Maboke" was. Searching for "Maboke, Central African Republic" turned out not to work like I'd hoped for. 😅 Though perhaps I should save the recipe. 🍲🐟

[image or embed]

— Tapani Hopkins (@tapani-hopkins.bsky.social) Feb 14, 2026 at 13:00

The original project used node.js, whereas I wanted something simple using just HTML and Javascript so it could be hosted using GitHub pages (or, indeed, on any other static hosting platform). I fired up Claude Code to help me with the port. I continue to be amazed at just how much fun this style of coding is, and the power of the tools. I make requests and suggestions, and Claude will fire up an instance of Google Chrome to check that the code works. I think a key feature of this style of programming is that it can reduce that inital hurdle when you know you need to make changes, and may even have made notes to yourself about what needs to be done, but there will the initial tedium of reworking old code to work with a new platform i.e., Googling questions, re-reading GitHub docs, etc. Instead, I get to focus on what I want to do, namely revive an old tool that I think people may find useful.

Written with StackEdit.

Wednesday, November 19, 2025

Model Context Protocol (MCP) and triple stores: natural language queries for knowledge graphs

Some quick notes based on experiments with Model Context Protocol (MCP) and (Claude](https://claude.ai).

Model Context Protocol (MCP) is all the rage right now, and I’ve been slow to take a look. Kingsley Idehen recently wrote The Semantic Web Project Didn’t Fail — It Was Waiting for AI (The Yin of its Yang) where he argued that Large Language Models (LLMs) provide (finally) a user-friendly way to query triple stores (i.e., knowledge graphs). Instead of simply presenting users with an empty SPARQL query box, we can now formulate a query in natural language and have AI convert that into SPARQL.

That eases the challenge of learning a new query language, but it get’s better. MCP enables us to connect an AI with another service. It acts a bit like a broker. You tell the AI what you want to do, the AI talks to the MCP server to figure out how to do what you want, gets the results, then converts them into a natural language (or other format) result that you can use. hence you can have a conversation with a knowledge graph!

There are examples of MCP servers that speak SPARQL, such as MCP Server SPARQL by
Eric Zhu. Since I mostly program in PHP (gasp) version 7 (gasp) I ended up asking ChatGPT to help write a simple MCP server. There then followed a dance between ChatGPT and Claude where ChatGPT would very confidently declare that the code was done, and Claude would get increasingly exasperated that I appeared to be trying to do something that wasn’t working. I eventually had to tell Claude to back the f**k off with its snarky comments and maybe be more helpful in its messages. Eventually I got a simple server up and running.

The code php-mcp-server is very basic, but supports SPARQL queries running on an instance of Oxigraph that runs on my Mac. For example, I can ask:

What publication cites the sequence https://identifiers.org/insdc/MH493846

and Claude will respond:

I’ll query for publications that cite that sequence.
Good! I found a publication. Let me get more details about it:
Perfect! The sequence https://identifiers.org/insdc/MH493846 is cited by:
Publication: “Integrative taxonomy reveals a new Gammarus species (Crustacea, Amphipoda) surviving in a previously unknown southeast European glacial refugium”
Authors: Copilaș-Ciocianu, D., Zimta, A.-A., and Petrusek, A.
DOI: https://doi.org/10.1111/jzs.12248

It will also show me the SPARQL queries it makes to find this information.

This feels like a game changer. The MCP server I’ve written is incredibly crude, but I can now start to query a knowledge graph about DNA barcodes and associated literature in plain English, and get back useful results.

What I really want to do is combine this with details on the actual papers (for example, lists of specimens sequenced, whether they are type specimens, where were the samples collected from, etc.) as a way to help curate databases such as BOLD. I recently released BOLD View (see blog post BOLD View: exploring DNA barcodes) to make it easier to explore DNAbarcode data, and I’m fascinated by how much scope there is for curation to add taxonomic identifications, geographic location, etc.

To make this curation eassier I’ve started to assemble a knowledge graph linking barcodes, Genbank sequences, and taxonomic names to the associated scientific literature, with the ultimate goal of being able to ask: “given this barcode that lacks a proper scientific name, is there anything in the published literature that can tell me what it actually is?”. The idea of being able to literally ask that question using a combination of an AI and a MCP server is vert exciting.

Written with StackEdit.

Thursday, August 07, 2025

Make Data Count Kaggle Competition

I’ve written several times here about the Make Data Count project and its major output to date, the Data Citation Corpus, currently at version 4 (see The fourth release of the Data Citation Corpus incorporates data citations from Europe PMC and additions to affiliation metadata).

In June Make Data Count launched a Kaggle Competition with the goal of developing a tool that will process articles (in either PDF or XML format), extract data citations (e.g., DOIs for datasets in repositories such as Dryad, or accession numbers such as 6TAP in the Protein Data Bank), and classify these citations as either “primary” (data published in that paper) or “secondary” (reuse of existing data

I think the competition is an excellent idea, and the $US100,000 is a great motivator to get people trying to solve this problem. I’m tacking part in the competition, which has meant learning Python very fast. I’ve dabbled a bit before, but this was a whole new thing. ChatGPT has been indespensible, especially in explaining why something I was doing wasn’t going to work, and what an error message really meant. The whole process became horribly addictive. You can submit a solution on five tiems a day, and the counter resets at midnight GMT, so there were nights I was up well after midnight coding and using up the following day’s submission quota! Another interesting feature is the lively discussion between people that are rivals for substantial prize money. Participants are sharing code and ideas, often not their best scoring ideas — after all, everyone wants to win — but still giving hints and support, and sharing findings.

The competition provides a small set of training data (about 500 PDFs and a simialr number of XML files). The idea is that you write code to analyse those files and output a list of data citations. You then submit your entry to Kaggle, which runs your code against a “hidden” set of PDFs and XML files and tells you your score. The best score wins prizes. My place in this competiton pretty accurately reflects my skills and ability :)

Issues with the competition

Unfortunately the competition itself has been — how shall I put this — poorly run. There has been virtually no engagement from DataCite in their own competition, despite repeated queries from the entrants to explain the often inexplicable reasoning for the scoring in the training data, or why some of the PDFs are wrong or incomplete. Some PDFs are preprints, not the actual papers (and may differ in whether they cite data or not). The XML comes in a variety of formats, which we weren’t told about. Some XML was “gold standard” JATS-XML as used by PubMed Central, others were publisher specific, or the output of PDF parsers or annotation tools.

I ended up making my own training data (https://doi.org/10.34740/kaggle/dsv/12667298) listing what I think are the actual data citations (about twice as many as are in the “official” training data).

There are some high scoring entries (see the leaderboard) so it looks like Make Data Count will get somes useful tools form this competition. My only concern is that these tools may be optimised to replicate the somewhat erratic and poorly described annotation process that DataCite used to create the training and “hidden” test data, rather than accuarately retrieve the actual data citations. Perhaps my concerns will prove unfounded, or maybe the tools can be easily retrained with better data.

But I am somewhat baffled that such an importasnt project for which Make Data Count have secured funding for serious prize money has been essentially left unattended by the organisers.

The competition runs until 3 September.

References

Written with StackEdit.

Tuesday, July 08, 2025

How many times are DNA barcoding datasets cited?

This note accompanies a dataset that I uploaded to Zenodo (https://doi.org/10.5281/zenodo.15824274). My goal in creating this dataset is to link data created on the Barcode of Life Data Systems to the DOIs for those datasets, and then to link those data DOIs to DOIs for the papers (if any) that created those datasets, and/or cited them.

For example, the paper “DNA barcodes enable higher taxonomic assignments in the Acari” (Young et al., 2021) cites three barcode datasets: DS-BINFL, DS-5FLR, and DS-10FLR. Each of these datasets has a DOI of the form: https://doi.org/10.5883/ plus the DS number. One reason I want to m ake these links is so that when the dataset is displayed, say, in my BOLD View app, I could also show the papers that created/cited the dataset, providing some context to the data (e.g., why was the data collected?). Another reason is that once we link data to papers we can do some interesting things, such as assign credit (Zeng et al. 2020), or discover what organisations funded the work. I hope to explore these topics in the future.

Matching datasets to publications was a tedious process, there are more details on the GitHub repository. I started with a Google Scholar search, then did lots of manual filtering and cleaning. Most of the articles have DOIs, and only these articles are included in the Zenodo dataset, which is intended as a contribution to Make Data Count.

This only scratches the surface of what could be done. There are many datasets that I could not find in the literature (they may never have been cited). I also want to retrieve links between individual DNA barcodes and the papers that published them. Apart from context and metrics, I’m also interested in whether these papers might contain more detailed information about the sequences, such as geographic localities. In this way we could potentially enrich the BOLD database, as part of the “virtuous cycle” envisioned by David Schindel (Schindel and Page, 2024).

References

Page, R. (2025). Citations of datasets published by Barcode of Life Data Systems (BOLD) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15824274

Schindel, D. E., & Page, R. M. P. (2024). Creating Virtuous Cycles for DNA Barcoding: A Case Study in Science Innovation, Entrepreneurship, and Diplomacy. DNA Barcoding, 7–32. https://doi.org/10.1007/978-1-0716-3581-0_1

Young, M. R., deWaard, J. R., & Hebert, P. D. N. (2021). DNA barcodes enable higher taxonomic assignments in the Acari. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-95147-8

Zeng, Tong, Longfeng Wu, Sarah Bratt, and Daniel E. Acuna. ‘Assigning Credit to Scientific Datasets Using Article Citation Networks’. Journal of Informetrics 14, no. 2 (1 May 2020): 101013. https://doi.org/10.1016/j.joi.2020.101013.

Written with StackEdit.

Thursday, June 05, 2025

A metabarcoding mess and the importance of just looking at the data

How to cite: Page, R. (2025). A metabarcoding mess and the importance of just looking at the data. https://doi.org/10.59350/q2v8n-wc488

Here I summarise a few posts on Bluesky where I raised concerns about some metadabarcoding datasets that were highlighted by GBIF:

>3.4 million insect records based on DNA metabarcoding of bulk samples from #Sweden and #Madagascar have been mobilized to GBIF thanks to collaborative efforts of research institutions led by the #NaturhistoriskaRiksmuseet link

Looking at these datasets it’s clear that something is wrong.

Data

The datasets discussed are for CO1 Amplicon Sequence Variants from Madagascar, which are part of the Insect Biome Atlas project. The data is described in Miraldo et al. https://doi.org/10.1038/s41597-025-05151-0. There are two datasets for Madagascar:

  • CO1 Amplicon Sequence Variants of leaf litter arthropod communities collected at Malaise traps from the Insect Biome Atlas project in Madagascar https://doi.org/10.15468/pad7pc
  • CO1 Amplicon Sequence Variants of bulk arthropod samples (mild lysis) collected with Malaise traps from the Insect Biome Atlas project in Madagascar https://doi.org/10.15468/6u5rum

In case the data changes in the future I’ve made snapshots of the two datasets and uploaded them to Zenodo doi:10.5281/zenodo.15599342. The files I downloaded (https://doi.org/10.15468/dl.kwjyjt and https://doi.org/10.15468/dl.2p3z5q) are the GBIF annotated archives, hence they include the mapping between the taxonomic names and GBIF’s backbone taxonomy.

Problem

In browsing the data on GBIF I noticed some striking distribution patterns: insects normally found in Europe and/or North America were also turning up in Madagascar, based solely on these metabarcoding datasets. For example, Helina impuncta.

Helina impuncta

Metadata barcoding data can be a complicated beast, especially if you try and navigate the multiple databases that house metadata on the sampling program and the output of sequencing machines. For example, GBIF occurrence 5162479277 is linked to ENA record ERR12944764 which in turn has multiple identifier links:

Study Accession Sample Accession Experiment Accession Run Accession Tax Id
PRJEB61109 SAMEA115499645 ERX12317105 ERR12944764 1234904

What’s nice about the GBIF datasets that they wrap all this up into a single package that we can explore. BLASTing a few sequences in these datasets suggests that the identifications of these sequences were probably correct, so the source of the problematic maps lies elsewhere.

Lots of maps

I wrote a simple PHP script to read the GBIF dataset, aggregate the GBIF taxon ids (i.e., the GBIF taxa that the sequences were mapped to) and draw a map for each taxon (code is on GitHub) . These maps use GBIF’s maps API to retrieve a tile (256 x 256 pixels) showing the distribution of each taxon on a global map (i.e., zoom level 0 on a tiled web map). I overlay that on a GBIF base map tile (see Base Map Tiles), and dump the output as HTML.

This is crude but gives a quick visual overview of the data. For the litter datasets there are a lot of these Euro-Madagascar distributions:

litter

For the malaise trap data the results look much more like what I’d expect, lots of taxa restricted to Madagascar.

malaise

But there are still examples of the problematic pattern mentioned above.

What happened?

In the paper describing the data there is a paragraph discussing contamination:

As part of data clean-up, it is usually advised to remove ASVs present in negative controls, or the maximum number of reads for those, from the entire dataset71. However, after careful inspection of our negative controls, we noticed that only a few ASVs were persistently showing up in control samples. The majority of ASVs seemed to be arthropod sequences that were present in the bulk samples, and also sporadically present in negative controls in relatively small numbers. This was presumably due to DNA spreading between samples through tiny droplets during sample processing, or to low-level of “index hopping”, leading to incorrect assignment of reads during sequencing, despite the use of double-unique indexes in library preparation72. link

The paper goes on to discuss possible examples of contamination. Looking at the results I suspect there has been a lot more contamination than the authors allow, especially for the litter dataset.

Summary

These results are preliminary, and I’ve contacted the authors of the paper to see if we can find out what happened. But for me the most obvious conclusions are:

  • Metabarcoding has the potential to generate a lot of spurious records that may negatively impact databases such as GBIF.
  • One of the great features of GBIF is that it enables you to simply look at the data. In an age of automated pipelines and big data I think visualisation is increasingly important. It’s often an easy way to discover that something is not as it should be.

References

Miraldo, A., Sundh, J., Iwaszkiewicz-Eggebrecht, E. et al. Data of the Insect Biome Atlas: a metabarcoding survey of the terrestrial arthropods of Sweden and Madagascar. Sci Data 12, 835 (2025). https://doi.org/10.1038/s41597-025-05151-0

Page, R. (2025). Snapshot of Insect Biome Atlas data for Madagascar from GBIF [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15599342

Written with StackEdit.

Friday, May 16, 2025

Tracking changes in DNA barcode BINs

Following on from releasing BOLD View I’ve started to explore how the classifcation of DNA barcodes changes over time. BOLD uses the RESL algorithm described in Ratnasingham & Hebert (2013, 2016) to cluster barcodes into “BINs”. As the number of DNA barcodes grows over time these clusters may change. For example, some clusters may increase in size as barcodes are added, and some clusters may be merged as sequences of intermediate similarity are found that link those BINs. Within the public-facing BOLD portal there is no way to see the history of a BIN (Meier et al., 2022), so I decided to explore this. I downloaded of data packages from BOLD for the period 2022-2024, as well as the BARCODE 500K data for 2016. BOLD issues regular releases of its data, querterly releases are persistent and received a DOI. More regular releases don’t get a DOI and seem to disappear from the web site, but I have a copy of the release for 06-Sep-2024, which I used to create BOLD View.

The data packages I’ve used to infer version history are listed below.

Dataset DOI
iBOLD.31-Dec-2016 10.5883/dp-ibold.31-dec-2016
BOLD_Public.30-Mar-2022 10.5883/dp-bold_public.30-mar-2022
BOLD_Public.06-Jul-2022 10.5883/dp-bold_public.06-jul-2022
BOLD_Public.28-Sep-2022 10.5883/dp-bold_public.28-sep-2022
BOLD_Public.30-Dec-2022 10.5883/dp-bold_public.30-dec-2022
BOLD_Public.31-Mar-2023 10.5883/dp-bold_public.31-mar-2023
BOLD_Public.30-Jun-2023 10.5883/dp-bold_public.30-jun-2023
BOLD_Public.29-Sep-2023 10.5883/dp-bold_public.29-sep-2023
BOLD_Public.29-Dec-2023 10.5883/dp-bold_public.29-dec-2023
BOLD_Public.29-Mar-2024 10.5883/dp-bold_public.29-mar-2024
BOLD_Public.19-Jul-2024 10.5883/dp-bold_public.19-jul-2024
BOLD_Public.06-Sep-2024 no DOI

Versioning

I am only interested in a few of the fields in the data, namely ,bin_uri, identification, identification_method, and identified_by. Note that field names can change between data packages, so we may have to translate field names, or assemble a field’s value from other fields (e.g., taxonomic classification). Rather than store all the data I used Tuple-versioning , so that we store values for processid and the various data fields, together values for valid_from and valid_to. The first time a combination of values is found we set valid_from to the YYYY-MM-DD date of the corresponding data package, and valid_to to NULL. Note that we may have multiple barcodes for a given processid (e.g., for different genes) so we index on both processid and marker_code. We also compute a MD5 hash of the data for a barcode to enable fast lookup of a particular set of values. The hash is not sufficient to identify an edit as the same set of values may have more than one period of validity. For example, a barcode may be in one BIN, then move to another, then move back again.

When we load the first data package (iBOLD.31-Dec-2016) all rows in the database will have NULL values for valid_to. This signals that those values for the data are currently valid. We then add the remaining data packages from oldest to most recent. For each barcode, if the data for a barcode in the current package is the same as that already in the database (i.e., for which valid_to is NULL) we do nothing. But if the data has changed we do the following:

  • set valid_to for the most recent row to the YYYY-MM-DD data of the current data package
  • add a new row with valid_from set to the same date, and valid_to set to NULL.

At the end of this process we have a list of values for the selected fields for each barcode, together with the time span that those values were valid.

Queries

There are two kinds of queries I’ve explored so far. The first is tracking the changes for an individual barcode, the other is the history of a BIN.

Barcode histories

Here is the history for XAF587-05

2022-03-30 - 2022-09-28

  • identification: Poanes hobomok
  • identified_by: Paul Hebert

2022-09-28 - 2024-07-19

  • identification: Lon hobomok
  • identified_by: Paul Hebert

2024-07-19 -

  • identification: Lon hobomok
  • identified_by: Paul D.N. Hebert

This examples shows that we need to be careful when counting edits to a barcode. We could simply record these as changes in identification and identifier, but is a little more complicated. Poanes hobomok and Lon hobomok are synonyms (Cong et al., 2019), so we’ve not changed the taxonomic identification, merely the name. In the absence of a single authoritative source of taxonomic names and synonyms I use TAXMATCH-like rules to “stem” the species names (Boyle, 2013), so that if two values of identification have the same species epithet (taking into account possible change in gender of the genus name) I treat these as changes in name, not identification. The other change is from “Paul Hebert” to “Paul D.N. Hebert”, which is clearly the same person. I compute the Levenshtein distance between values of identified_by and treat any value > 5 as a different name (5 was chosen so that “Paul Hebert” to “Paul D.N. Hebert” would be the same).

BINs

For BINs reconstruct the history by taking a BIN and finding all barcodes that have, at any point in time, been a member of that BIN. So far the best way I’ve come with to visualise the changes in a BIN is to create a “storyline” (see Liu et al., 2013) where the composition of each BIN is shown at each timeslice.

For example, here is the history of BIN BOLD:ABX0491 which contains barcocdes identifiers as Rhamma, Rhamma anosma, and Rhamma bilix (Prieto, et al. 2021).

The vertical columns are time slices, barcodes in the same BIN are grouped together in coloured rectangles, and the history of each barcode can be traced from left to right. You can see cases where barcodes have moved between BINs (BOLD:ABX0491 gobbled up two smaller BINs). There are also barcodes that were (for one time slice) not in any BIN.

This visualisation has been challenging to create, I ended up using # Graphviz as implememted in (https://dreampuf.github.io/GraphvizOnline).

Summary

This is still early stages, but it looks promising. The next step would be to incorporate it into BOLD View. It might also be interetsing to develop measures of stability of barcode clustering based on how often members move around.

References

  • Boyle, B., Hopkins, N., Lu, Z., Raygoza Garay, J. A., Mozzherin, D., Rees, T., Matasci, N., Narro, M. L., Piel, W. H., Mckay, S. J., Lowry, S., Freeland, C., Peet, R. K., & Enquist, B. J. (2013). The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC Bioinformatics, 14(1). https://doi.org/10.1186/1471-2105-14-16
  • Cong, Q., Zhang, J., Shen, J., & Grishin, N. V. (2019). Fifty new genera of Hesperiidae (Lepidoptera). Insecta Mundi, 2019, 0731. https://doi.org/10.5281/zenodo.3677235
  • Hebert, P., & Ratnasingham, S. (2016). Systems, methods, and computer program products for merging a new nucleotide or amino acid sequence into operational taxonomic units (United States Patent US20160103958A1). [https://patents.google.com/patent/US20160103958A1)
  • Liu, S., Wu, Y., Wei, E., Liu, M., & Liu, Y. (2013). StoryFlow: Tracking the Evolution of Stories. IEEE Transactions on Visualization and Computer Graphics, 19(12), 2436–2445. https://doi.org/10.1109/TVCG.2013.196
  • Meier, R., Blaimer, B.B., Buenaventura, E., Hartop, E., von Rintelen, T., Srivathsan, A. and Yeo, D. (2022), A re-analysis of the data in Sharkey et al.’s (2021) minimalist revision reveals that BINs do not deserve names, but BOLD Systems needs a stronger commitment to open science. Cladistics, 38: 264-275. https://doi.org/10.1111/cla.12489
  • Prieto, C., Faynel, C., Robbins, R., & Hausmann, A. (2021). Congruence between morphology-based species and Barcode Index Numbers (BINs) in Neotropical Eumaeini (Lycaenidae). PeerJ, 9, e11843. https://doi.org/10.7717/peerj.11843
  • Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLOS ONE, 8(7), e66213. https://doi.org/10.1371/journal.pone.0066213

Written with StackEdit.