iPhylo: Using AI to understand a DNA barcoding mystery

Roderic D. M. Page

Tuesday, March 10, 2026

Using AI to understand a DNA barcoding mystery

How to cite: Page, R. (2026). Using AI to understand a DNA barcoding mystery. https://doi.org/10.59350/nbsfn-91m72

As I continue to add features to BOLD-View I keep coming across interesting cases where something seems not quite right with the data. Typically this prompts further investigation, which typically means going down a rabbit hole. As an example, take barcode GMAEA6199-22 which was collected by Donald Hobern in Australia and is identified to order level as Strepsiptera (I thank Donald for this example).

This barcode falls within a cluster which contains sequences labelled as either strepsipteran (in many cases identified down to species, Elenchus varleyi) or hempiteran. Almost all the strepsipteran sequences are assigned to a BIN BOLD:ACH2898. The hemipteran sequences are not assigned to a BIN, even though they are essentially identical to the strepsipteran sequences.

A new feature I’ve added to BOLD View’s display for a barcode is a map of all similar sequences, grouped by BINs. For this example the Stepsiptera BIN has a wide geographic range, including central America and South Africa, as well as Australia (through GMAEA6199-22 which is a stepsipteran but not included in the BIN).

So, we have a widespread BIN, likely to be Elenchus varleyi, which has identical sequences to those labelled as hemipteran. What is going on?

Now if you know anything about Strepsiptera you’ll know that they have a pretty bonkers biology, being endoparasites of insects, and male and females have quite different life histories. But as an exercise I decided to ask Claude to see what it made of this situation. Here is the prompt I used.

I have an interesting DNA barcoding case I want you to look at. I am uploading part of a phylogeny for DNA barcodes for a strepsiteran Elenchus varleyi. Note that the barcodes mix sequences from Elenchus and Hemiptera. The barcodes come from samples in Australia, South Africa, and central America. I am uploading a Markdown file for a paper with the DOI 10.1071/it9890175. If you need to get information on species distributions then https://gbif.org is a good source. Given all this information, can you suggest explanations for why we have barcode sequences from Elenchus and Hemiptera, why only some sequences are assigned to a BIN, and why are these barcodes widely distributed.

I also uploaded the tree picture you see above, and the paper “Descriptions and biological notes of the Australian Elenchidae (Strepsiptera)” doi:10.1071/IT9890175 which described Elenchus varleyi (I got the DOI for the paper from another of my projects, BioNames.

The paper is behind a paywall, and isn't on SciHub, but is on [ResearchGate](https://www.researchgate.net/publication/248899086), so I downloaded the PDF and used Marker to convert it to Markdown, which has become the default language of choice for sending documents to AIs.

Claude thought for a bit and then came back with a summary that basically said:

the mixture of strepsipteran and hemipteran sequences is likely due to the “hemipteran” barcodes actually being for their strepsipteran parasites rather than the hemipterans themselves.
give that one of the hosts of Elenchus varleyi is Sogatella kolophon, which is widespread, that would account for the occurrence of essentially identical DNA sequences across separate continents.

Claude noted that a single hemipteran can host multiple strepsipteran individuals, increasing the chance of mistakenly sequencing the pareasite rather than the host. The picture below shows adult neotenic females ♀ of the Elenchus varleyi in the abdomen of the hemipteran Sogatella kolophon. (Fig. 14 from doi:10.1071/IT9890175).

Now, you and I could figure this out with some background biology and reading the paper, but the point here is that Claude could do it more or less automatically, which means we could potentally scale this to look at all “odd” or “anomalous” barcodes and get suggestions as to what has happened. Hence we have the possibility for large-scale database curation (see also “CurateGPT: A flexible language-model assisted biocuration tool” doi:10.48550/arXiv.2411.00046).

That said, we need some things in place. It helps to have links to information on the species and sequences involved, e.g. what paper(s) described the species, what paper(s) published the sequences. In this case I used BioNames to find the DOI for the paper describing Elenchus varleyi which was behind a paywall. That is likely to be an obstacle to this approach.

Out of curiousity I explored an alternative scenario, in part inspired by Plazi. Imagine that we can’t share the full text of the paper, but we do have access to the facts or claims the paper makes. Plazi doesn’t quite do this, it extract images and the text of taxonomic treatments, but let’s imagine we could extract a series of triples from the paper, such as:

(Elenchus varleyi, parasitises, Sogatella kolophon)

Given a set of these we could potentially still infer the cause of the barcoding patterns we were seeing, but without needing access to the paywalled text (we’d need access to the text to extract the triples, but Claude wouldn’t need to see the text). Of course, this is not a new idea (Peter Murray-Rust has been advocating for this for a long time).

But what triples should we extract from a given paper. Out of interest I asked Claude what triples it would need to generate the answer it gave me to the strepsiptera/hemiptera example. Here is what it gave me:
Parasite–host relationships:

(Elenchus varleyi, parasitises, Sogatella kolophon)
(Elenchus varleyi, parasitises, Toya drope)
(Elenchus varleyi, host_family, Delphacidae)
(Sogatella kolophon, is_majority_host_of, Elenchus varleyi)

Life cycle and tissue intimacy (critical for the contamination argument):

(Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
(Elenchus varleyi female, life_strategy, neotenic/larviform)
(Elenchus varleyi female, spends_entire_life_in, host body)
(Elenchus varleyi male, develops_as, endoparasitic larva)
(Elenchus varleyi female, extrudes_only_in, adult hosts)

Superparasitism (strengthens contamination argument):

(Elenchus varleyi, exhibits, superparasitism)
(superparasitism, maximum_observed, six parasitoids per host)
(superparasitism, frequency, common in Brisbane)

Dispersal mechanism:

(Elenchus varleyi female, dispersed_by, macropterous adult hosts)
(Elenchus varleyi larvae, dispersed_by, macropterous adult hosts)
(Elenchus varleyi male pupa, extrudes_in, both nymphs and macropterous adults)

Taxonomic and geographic scope:

(Elenchus, distribution, cosmopolitan)
(Elenchus, only_parasitises, Delphacidae)
(Elenchus varleyi, type_locality, Brisbane Queensland)
(Elenchus varleyi, distribution, Queensland NSW Canberra Northern Territory)

Supporting observation (supporting wide distribution argument):

(Perkins 1907, observed, Elenchus from widely separated countries show very little difference)

Now I think that some of these triples we could dispense with. I asked Claude and it said the following would be enough:

(Elenchus varleyi, parasitises, Sogatella kolophon)
(Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
(Elenchus varleyi, host_family, Delphacidae)
(Elenchus varleyi female, dispersed_by, macropterous adult hosts)
(Elenchus, distribution, cosmopolitan)

These are fairly simple statements that could liklely be extracted from the paper. That raises issues of how we do that, and how we express those triples. I asked Claude for suggested vocabularies, it mentioned Darwin Core and the Relations Ontology, among others. I am congenitally averse to big, verbose ontologies so I’d prefer something light weight, and maybe Darwin Core would be enough?

So, in summary, I’m encouraged by the way Claude suggested a plausible explanation for the pattern in the barcode tree, and that it might not always need access to full text to do so (although I suspect giving an LLM access to full text is likely to beat giving it a set of triples that might not encompass all the relevant information in the paper). This also gives me a further incentive to work on the problem of providing context for each barcode, especially the scientific papers that published the sequences, and the papers that published the taxonomic names.

But one problem still remains. How do we get all this information back into BOLD so that a user looking at these sequences knows what is going on, knows that "Hemiptera" doesn't mean "Hemiptera" in this case, and that what we are seeing is a case of a widespread insect host being infected by a widespread parasite, which was originally described from Australia. The ability to add annotations and thrid party analyses will become crucial if people are to get the most out of DNA barcoding databases.

References

Caufield, H., Kroll, C., O’Neil, S. T., Reese, J. T., Joachimiak, M. P., Hegde, H., Harris, N. L., Krishnamurthy, M., McLaughlin, J. A., Smedley, D., Haendel, M. A., Robinson, P. N., & Mungall, C. J. (2024). CurateGPT: A flexible language-model assisted biocuration tool (arXiv:2411.00046). arXiv. https://doi.org/10.48550/arXiv.2411.00046
Kathirithamby, J. (1989). Descriptions and biological notes of the Australian Elenchidae (Strepsiptera). Invertebrate Taxonomy, 3(2), 175–195. https://doi.org/10.1071/it9890175

Written with StackEdit.