iPhylo

Roderic D. M. Page

Friday, June 05, 2026

AI-assisted review of BIOSCAN-5M paper

The following is a conversation with Claude walking through the paper BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity (arXiv:2406.12723, doi:10.48550/arXiv.2406.12723). This is a big paper with lots of models and lots of jargon (as is typical of machine elarning papers), and I wanted to make sense of what the authors actually did, and what the practical outcome is. My gut instinct was that the photos provided are going to be next to useless at identifying insect species, DNA sequences are much better at that task, so what was the point? What I got from this conversation is that the papers shows that you can improve the ability to identify insects based on images if you have a model trained on both images and DNA sequences together, even if the images themselves don't look all that promising.

Q: I want to make sense of this paper on BIOSCAN-5M.

What it is. A dataset, not a new model — over 5 million arthropod specimens
(98% insects), each described across multiple modalities, plus benchmark
experiments. It’s the bigger successor to BIOSCAN-1M (~1.1M insects).

Each specimen has:

A microscope image (originally 1024×768, plus cropped and 256px versions).
A DNA barcode — ~648 bp of the COI gene, stored as a nucleotide string.
Taxonomic labels across seven ranks (phylum → … → species).
Geographic data (country, province, lat/long) and a size estimate.

Two problems baked into the data:

Massively long-tailed. Diptera (flies) are ~50% of everything; 63% of
specimens are from Costa Rica; two-thirds of species have ≤6 samples.
Mostly unlabelled at fine ranks. Over 80% have no label below family —
reframed as an opportunity for self-supervised / semi-supervised learning.

The data splits partition by a “species set” idea:

Seen — proper species name; normal train/val/test.
Unseen — known genus, placeholder species; tests handling a new species.
Heldout — placeholder genus and species.
Unknown — no species label (~91%); goes to the pretrain split.

Splits are made at the barcode level so sequences don’t leak across train/test.

The three experiments:

DNA classification (BarcodeBERT). A BERT-style model pretrained on
barcodes beat bigger general-purpose genomic models; closed-world species ID
~99% (BLAST alignment gets 99.78%). The real win was open-world genus
assignment for unseen species (~47%).
Zero-shot clustering. Cluster pretrained embeddings without fine-tuning.
DNA embeddings crushed image embeddings; combining them did worse than DNA
alone because the strong DNA signal got diluted.
Multimodal alignment (CLIBD). CLIP-style contrastive training puts images,
DNA, and text in one shared space, enabling cross-modal queries. Bigger data
improved coarse ranks; species stayed hard.

One-line takeaway: DNA barcodes are by far the most informative signal for
telling insect species apart, and the dataset’s value is a huge, realistically
messy, multimodal playground for the “millions of mostly-unknown species”
problem.

Q: I’d prepared a Markdown version with figures — can you use it?

(Clarification: the earlier summary came from fetching the arXiv abstract page,
which served the LaTeX source — no local file was read. The uploads folder was
empty at first; a ZIP was then uploaded.)

The ZIP turned out to be the v6 (published) version, with a few differences
from the arXiv text: the repo moved to bioscan-ml/BIOSCAN-5M, “BIOSCAN-CLIP” is
now CLIBD, and 1.2M specimens are labelled to genus/species. It contained the
full Markdown plus 35 figure images.

Q: (Figures) — the look-alikes, the geography, the clustering, retrieval.

Figure 18 (Diptera look-alikes): each row is one genus, three species side
by side (e.g. Agromyza albipennis / frontella / idaeiana) — near
indistinguishable in a dish photo. This is why image-only ID stalls at coarse
ranks and DNA is needed for species.
Figure 11 (geography): Costa Rica = 3,256,316 samples; next is South Africa
at 322,096 — a tenfold drop. One site dominates; a long tail of token coverage
elsewhere.
Figure 4 (zero-shot clustering AMI): image encoders slope down hard (DINO
~88% at class → ~10% at species); DNA encoders stay high (HyenaDNA ~97% → ~90%
at species/BIN). DNA wins at every rank and the gap explodes as ranks get finer.
(Correction: image encoders peak at the coarsest ranks and decline
monotonically — the older “best at family” claim came from a prior paper.)
Figure 9 (retrieval): colour-coded matches (green = same species, etc.).
DNA→DNA is almost all green; image→image picks up family/order-level matches.

Recurring theme: the DNA barcode carries almost all the species signal;
images are useful at coarse levels or as a queryable stand-in when DNA is absent.

Q: Show me Figure 2.

A 3×8 grid of original full-size specimen images, showing the diversity of the
collection (iridescent beetles, lacewings, ladybirds, wasps, ants, flies) — the
deliberate contrast to Figure 18’s look-alikes. Every specimen sits in a circular
dish on a pale background under controlled lighting, which is what makes the
cropping tool and pixel-count size estimate work. Telling a green beetle from a
yellow lacewing is the “easy” coarse-rank axis; the hard part only appears within
a genus.

Q: In Table 5, what are “query and key combinations,” and the “Aligned embeddings” ticks?

Query/key = the model works by retrieval, not direct classification. Keys
are a labelled reference database (each embedded); a query is the new specimen —
embed it, find the nearest key, copy that key’s label. “X-to-Y” = query type X,
keys type Y:

DNA-to-DNA — match a barcode against known barcodes.
Image-to-Image — match a photo against known photos.
Image-to-DNA — the cross-modal case; only possible because contrastive
training put images and DNA in one space. Baseline (no alignment) scores ~0%
here; after alignment it jumps to ~50–64% at genus — that delta is the point.

Aligned-embeddings ticks (Img / DNA / Txt) = which modalities were pulled
together during training (✓ = included). Each taxon block has four rows:
✗✗✗ baseline, ✓✓✓ at 1M, ✓✓✗ at 5M, ✓✓✓ at 5M. Reading ticks against numbers
gives the three claims: alignment helps hugely; more data helps; adding text
helps coarse ranks but slightly hurts species (only ~9% of records have species
labels).

Q: So images alone are poor at species level but improve when aligned with DNA?

Yes, with two refinements:

Alignment is at training time; you don’t need DNA at inference.
Image-to-Image species H.M. goes 16.6% (raw) → 67.5% (aligned). The DNA
reshaped the image space during training, then is gone at deployment.
"More accurate" ≠ “good” at species. 67.5% still trails DNA-to-DNA (~97%),
and unseen species lag seen (60.8 vs 75.9).

The ✓✓✗ row (image+DNA, no text) captures almost the whole gain, so DNA — not
text — is the driver of the image improvement.

Q: Were unseen sequences used in any training?

Mostly no — by design. Unseen species go to val_unseen / test_unseen /
key_unseen; CLIBD trains on pretrain + train only. No unseen label and no
unseen barcode (seen/unseen are de-duplicated by barcode) is trained on.

The wildcard: pretrain is the unknown set, which “may truly belong in any
of the other three categories,” and barcode de-dup is described only for
seen/unseen. So an individual that is biologically an unseen species could sit
unlabelled in pretrain. Clean label hold-out; slightly leakier sequence
hold-out (the DNA encoder also had upstream BOLD pretraining).

Why seen/unseen differ though both are species-level: (1) exposure — seen
species shaped the embedding space; unseen require generalisation; (2)
partition composition — unseen partitions are smaller and skewed toward
Hymenoptera/Lepidoptera, so the macro-averages aren’t head-to-head. At coarse
ranks unseen sometimes scores higher; only at species does the exposure penalty
make unseen < seen.

Q: Why does DNA + images improve DNA-only querying, when images are poor at barcodes?

“Poor at classifying barcodes” and “useful as a training signal” measure
different things.

The contrastive loss pairs each specimen’s DNA embedding with that same
specimen’s image — there’s no DNA-to-DNA positive pair. So the DNA encoder’s
signal is “sit where the image says this specimen sits, away from others.” The
image is a shared coordinate frame. DNA-to-DNA improves partly because the space
gets reorganised around specimen identity (a retrieval-shaped objective)
instead of masked-language-modelling token statistics.

Image-to-DNA scoring ~20% at species is a point-to-point test; it says nothing
about whether images provide useful aggregate grouping during training. Images
are a fuzzy-but-correlated teacher: roughly right at coarse structure, vague at
species. Averaged over millions of pairs, that regularises the geometry; the DNA
fills in the fine detail the image couldn’t.

Caveats: no “DNA fine-tuned contrastively with no image partner” row exists,
so “image contribution” can’t be cleanly isolated from “contrastive fine-tuning
in general”; and pulling DNA toward the weaker image space can slightly cost
precision at the finest ranks.

Q: So to improve an image species-classifier, align image and DNA embeddings?

Yes — the practical takeaway. Pay the DNA cost once at training; deploy a better
image-only model. Evidence: Image-to-Image species 16.6% → 67.5%.

Three bounding conditions:

You need paired data (photo + barcode for the same specimen). Can’t
retrofit onto a photo-only archive.
The lift is real but the ceiling is low at species (~67%); alignment makes
images genuinely strong only at coarse ranks (low-90s).
Attribute carefully — much of the jump is “contrastive in-domain training
at all”; DNA is the best partner, not provably the sole cause.

Recipe: get a paired image+barcode set for your taxa, fine-tune the image encoder
contrastively against the DNA, then ship the image model alone. Pilot on your own
imaging conditions first.

Q: So it’s hard to generalise beyond the images this system collects?

Two separate questions:

Across species, same imaging setup: reasonably supported (that’s what
seen/unseen tests).
Across imaging conditions: essentially untested, and that’s the real gap.
Every specimen is a posed, dead arthropod in a standardised dish, fixed scale,
consistent lighting. The size features and cropping tool all depend on that. A
field photo (live insect, clutter, occlusion, unknown scale) is a different
distribution on every axis, and nothing in the benchmark measures the jump.

The alignment also transfers a specific signal — the lab-visible correlates of
species. The model learned “lab-visible correlates of species,” not “species.”
Strong recipe for specimen-imaging pipelines; speculative for in-the-wild
ID. Not a flaw — it’s a dataset/benchmark paper, upfront about standardised
imaging and geographic skew.

Q: They had geographic data (Fig 10) but didn’t use it — interesting.

Right — Figure 10 shows full geolocation (country/province/lat/long) per record,
yet none of the three experiments use it.

As a predictive feature it’s treacherous: 63% Costa Rica means a model would
learn collection base rates, not biology — inflated benchmarks, collapse on
deployment. A defensible reason to exclude it.
As scientific signal it’s exactly what you’d want: species have ranges;
geography can break ties between look-alikes that don’t co-occur, and powers
range-mapping / invasive detection — the dataset’s stated motivations.

Best read as a deliberate boundary (geography isn’t an encoder-friendly modality
for the contrastive story) rather than an oversight; the skew is flagged as a
limitation. A sharp follow-up: “does range data disambiguate look-alikes without
just memorising collection sites?”

Q: Could embeddings flag labelling errors (e.g. a bug inside a fly species)?

Yes — the zero-shot clustering AMI already measures label/structure agreement;
the disagreements are the candidate-error list. Three cases with different
signatures:

Misidentification — embedding sits inside cluster B but labelled A
(label–neighbourhood disagreement). Easiest to catch.
Wrong target sequenced (parasite/contaminant) — image says fly, DNA says
wasp (cross-modal disagreement). The multimodal angle is uniquely good here.
Genuine novelty — a coherent detached sub-cluster; a feature, not an error.

DNA gives a near-ground-truth reference (BIN system). A cheap first pass needs no
embeddings: flag specimens whose label disagrees with their BIN, or whose barcode
distance to conspecifics is abnormal (Figures 14–16). Embeddings earn their keep
for the cross-modal case. Caveats: image-based flags are noisy at species
(look-alikes); use a label-free off-the-shelf encoder to avoid circularity; treat
it as triage, not auto-correction.

Q: (Figures 14–16) — pairwise barcode distance distributions.

Box plots of within-group Damerau–Levenshtein (edit) distances:

Fig 14 (by class): medians ~150–200, long outlier tails (diverse classes
like Copepoda, Malacostraca).
Fig 15 (by order): medians come down and tighten.
Fig 16 (by species, 100 bins): the curve collapses toward zero almost
immediately — conspecific barcodes are nearly identical.

Monotonic story: class ≫ order ≫ species in internal distance, bottoming near
zero at species. That’s why barcoding works and why DNA crushed images at fine
ranks. The high-distance left-edge species bins and extreme outliers are the
actionable suspect residue.

Q: Can the “suspect list” be extracted easily?

Yes — the easiest version needs no sequences or distances, just a BIN ↔ species
cross-tabulation on two metadata columns:

One BIN → many species names = misID / inconsistent naming (the strongest
flag; the inverse of what curators caught manually).
One species → many BINs = cryptic complex, lumped label, or contaminant.

import pandas as pd
df = pd.read_csv("BIOSCAN_5M_Insect_Dataset_metadata.csv",
                 usecols=["species", "dna_bin", "processid"]).dropna(
                 subset=["species", "dna_bin"])
bins_per_species  = df.groupby("species")["dna_bin"].nunique().sort_values(ascending=False)
species_per_bin   = df.groupby("dna_bin")["species"].nunique().sort_values(ascending=False)

A second tier (Figure 16 signal) uses within-species edit distance, but
distance-to-consensus or BIN-count is far cheaper. Caveats: strip placeholder
names first; ignore singletons (require ≥3); it’s triage, not truth.

(Note: the uploaded ZIP held the paper + figures, not the 5M-record metadata, so
this recipe is for the metadata CSV from the dataset page.)

Q: Summarise contrastive learning as used here.

Teach the encoders that a specimen’s modalities belong together and different
specimens belong apart — pull matched pairs close, push mismatched apart.

Per specimen, embed image and DNA. The positive pair is the same specimen’s
image+DNA (rewarded close); negatives are that image with other specimens’
DNA (pushed apart). Over millions of specimens, two independently-pretrained
encoders get dragged into one shared space where “near” = “biologically
similar,” regardless of modality — enabling cross-modal queries.

Supervision is structural, not labelled (positive = same physical
specimen), so it runs on unlabelled pretrain data.
Explains the earlier paradox: the DNA acts as a coordinate frame teaching the
image space a finer organisation than pixels gave it.
Loss is NT-Xent (InfoNCE family); big batches matter (here, 2000) because
more negatives = better contrast.

Versus the others: supervised maps input→label; masked-modelling SSL learns one
modality’s internal structure; contrastive learns relationships across samples
and modalities — the only one that can build a single image/DNA/text space.

Q: SSL — what is it actually learning?

SSL invents its own labels from the data. BarcodeBERT masks ~50% of the barcode
tokens and predicts them from context; the ground truth is just the hidden
letters. To win, the model must learn the statistical structure of barcodes —
which is what distinguishes lineages. (Image SSL does the analogue with pixels /
crops.)

The output isn’t a class — it’s an embedding, and training makes the
embedding space organised: similar sequences land near each other. Nobody
defined “species”; relatedness drives similarity, so taxonomic structure emerges.
This is why unlabelled data is good fuel, why zero-shot clustering works, and why
error detection works (structure built independently of possibly-wrong labels).

Two stages: SSL builds a label-free map of “what’s similar to what”; a small
supervised step names regions of it. The data-hungry part is unlabelled; labels
are spent only on naming.

Q: Is it like ordination on a sequence-distance matrix, just higher-dimensional?

Good analogy — same “sequence space” where proximity = similarity, and taxa
appear as clusters. The real difference isn’t dimensionality:

Ordination uses a fixed, hand-chosen distance (edit distance, K2P…),
treating every position by the same rule.
SSL learns a context-aware distance — discovering which positions and
combinations carry information — and embodies it in a reusable function you
can project new sequences into (an ordination is a one-off layout).

Smaller distinctions: the transformer doesn’t need alignment (k-mer tokens +
learned position); the encoder generalises to unseen sequences immediately; and
the learned-embedding framing extends across modalities (one shared space), which
a single distance matrix can’t. The two would agree on coarse blobs and diverge
at the species boundary — exactly where a learned metric earns its keep.

Native sizes differ: ViT-B and BarcodeBERT are 768-d, BERT-small is 512-d
(and the paper notes HyenaDNA 128-d, NT 512-d elsewhere). They were pretrained
independently.

Alignment across different sizes uses a projection head per encoder — a small
learned layer mapping each native output to one common dimension (standard CLIP
design). So matching the final dimension is required; matching the encoders
is not.

But same size ≠ multimodal. Equal-length vectors only let you compute a
similarity; they don’t make it meaningful. Table 5’s ✗✗✗ baseline has
compatible sizes yet Image-to-DNA scores ~0% — the spaces were never aligned.
What makes it multimodal is the contrastive training that organises those
vectors into one shared, cross-comparable space (the 0% → ~90% jump). Dimension-
matching is plumbing; alignment is the thing. Corollary: you can add a modality
just by giving it an encoder + projection — no need to touch the others.

Q: What do “linear probe” and “1NN” probes mean (Table 4)?

Three ways to evaluate the same pretrained encoder, differing in how much you
may change it:

Fine-tuned — train the whole encoder + head on the task. Best accuracy
(~99%), but can’t tell whether pretraining or fine-tuning earned it.
Linear probe — freeze the encoder, train only a linear layer. Tests
whether the frozen embeddings are linearly separable → “how good are the raw
embeddings?” (DNABERT-S shines: 95.5%.)
1NN probe — no training at all; classify by nearest embedding. Purest test
of the embedding geometry.

Crucially the columns aren’t the same task: Fine-tuned and Linear probe are
closed-world seen-species; 1NN is open-world (“Unseen: Genus”) — assign an
unseen species’ barcode to the right genus, where you can’t train, so 1NN is
the only fair tool. That’s where the authors’ model leads dramatically (47.0% vs
~18–23%), because fine-tuned accuracy is saturated (BLAST alone hits 99.78%) and
doesn’t discriminate. Each probe strips away more help, so each is a stricter
test of what SSL baked in. (Table 6 uses the same three probes to pick the 8-4-4
architecture.)

Q: (Table 5 in full.)

Multimodal retrieval top-1 macro accuracy (%) on the test set. Each taxon block:
✗✗✗ baseline, then ✓✓✓ 1M, ✓✓✗ 5M, ✓✓✓ 5M. H.M. = harmonic mean of seen/unseen.

Taxon	Rec.	Img/DNA/Txt	DNA→DNA H.M.	Img→Img H.M.	Img→DNA H.M.
Order	—	✗✗✗	96.8	80.2	4.6
Order	5M	✓✓✓	100	96.8	94.4
Family	—	✗✗✗	91.2	53.8	0.4
Family	5M	✓✓✓	99.7	93.0	82.7
Genus	—	✗✗✗	86.2	32.9	0.0
Genus	5M	✓✓✓	99.2	85.0	56.1
Species	—	✗✗✗	79.8	16.6	0.0
Species	5M	✓✓✗	97.0	67.5	22.0
Species	5M	✓✓✓	97.0	67.1	20.3

(Condensed to H.M. columns; the full table also lists separate Seen/Unseen
values per group.) Three reads: ✗✗✗ → trained (Img→DNA 0 → high); 1M → 5M (more
data helps); ✓✓✗ vs ✓✓✓ at 5M (text helps coarse, slightly hurts species).

Q: Summary — DNA beats photos of dead insects under ring lights?

Faithful, and not cynical. DNA→DNA sits at 97–100% across ranks; image→image tops
out in the 60s at species; the cross-modal trick helps because DNA lends signal
to images, not vice versa.

Two refinements:

“DNA beats photos” is almost a tautology — a barcode is designed to separate
species. The interesting part is how far images can be dragged up the ranks
by alignment (usable at order/family).
DNA wins on accuracy but loses on cost and scale. Photos are cheap and
scalable; sequencing isn’t. The contribution is a principled way to spend the
expensive modality to make the cheap one better.

Bumper sticker: for telling species apart, sequence beats picture, and the
clever bit is using the sequence to teach the picture.

Q: Generous conclusion — a DNA+image model identifies new image samples better than image-only?

First half exactly right: identifying insect images improves if you train on
paired images + DNA for the same specimen.

One slip: DNA is a training ingredient, not a runtime input. In the
photo-ID scenario (Img→Img) there’s no DNA at inference — the improvement is baked
into the image encoder’s weights, then DNA is discarded. So you ship an
image-only model that was trained with DNA alignment. You never sequence the
new sample; you pay the DNA cost once on the training set.

(If you did have DNA for new specimens, you’d just run DNA→DNA at ~97% and skip
the photo — so the runtime-multimodal case isn’t really “identify new photos.”)

Precise version:

Two footnotes: the comparison is against a raw image encoder, not an
identically-fine-tuned image-only one, so the DNA-specific contribution isn’t
cleanly isolated; and “new sample” holds within the same imaging setup — a field
photo is an untested distribution.

Q: Do the authors themselves clearly state that conclusion?

They state the building block, but more cautiously and more narrowly than the
packaged conclusion — and they never frame it as “identify previously unseen
images.” Our version is a fair extrapolation, not a liftable sentence.

The closest the paper comes is in the §4.4 Results: the full model “improves
classification accuracy for same-modality queries and enables cross-modality
queries. By aligning to DNA, our image embeddings are able to capture finer
details.” That clause endorses the mechanism we reached. But three gaps remain:

"Same-modality queries," not “unseen images.” They claim Image-to-Image
accuracy rises (and the unseen numbers do rise in Table 5), but never distil it
into a headline like “DNA+image training improves identification of new images.”
That synthesis is ours.
No deployment claim. The “throw the DNA away at inference / never sequence
the new sample” framing — the practically important part — is true of their
setup but left implicit. They describe a retrieval system, not a field tool.
Baseline is the raw encoder. They compare the aligned model to “the initial
pretrained encoders before multimodal contrastive learning,” i.e. an untrained
ViT — not an image-only model fine-tuned the same way. So “DNA specifically
causes the gain” stays our careful inference, not their stated result.

The formal Conclusion (§5) is more reserved still — dataset-paper language: it
lists “multimodal classification combining DNA, images, and taxonomic labels” as a
benchmark “demonstrating real-world applicability,” and stops. No sentence singles
out “DNA improves image classification” as a takeaway.

Net: they clearly state the building block (aligning to DNA lets image
embeddings capture finer detail and raises same-modality accuracy, including on
unseen species); they do not state the packaged conclusion (DNA+image training
yields a better image-only identifier for new samples, no sequencing at
deployment). The pieces and the mechanism are theirs; the finished sentence is the
reader’s.

Q: Do they state anything like it in the abstract?

No — the abstract is even more neutral than the body. Its only sentence on the
multimodal experiment: the contrastive learning “yields a general shared
embedding space enabling taxonomic classification using multiple types of
information and modalities.”

That’s a capability claim — the dataset enables a shared space and cross-modal
classification. It says nothing about improvement, nothing comparing DNA to
images, nothing about one modality teaching the other, and nothing about unseen
images. Even the “aligning to DNA → finer image detail” line from §4.4 doesn’t
make it into the abstract.

The claim weakens as you move outward toward what most people read:

Table 5 / Results (innermost): concrete — image accuracy rises, DNA
alignment sharpens image embeddings, reported on seen and unseen species.
§5 Conclusion: softer — multimodal classification as a benchmark
“demonstrating real-world applicability.”
Abstract (outermost): softest — just “enabling taxonomic classification
using multiple modalities.”

This is the genre, not evasion: a dataset paper advertises capability and leaves
interpretation of the numbers to the reader.

A single honest line for citation: the paper demonstrates that contrastively
aligning images with DNA produces a shared space in which same-modality image
retrieval improves over the raw encoder (including on unseen species) — but it
does not claim, in the abstract or conclusion, that DNA-trained models are a
better image-only identifier for new samples. That packaging is a defensible
inference from Table 5, not their stated conclusion.

Wednesday, May 13, 2026

A new way to view the Tree of Life

One of the grand challenges of comparative biology is to assemble the [“tree of life”](https://en.wikipedia.org/wiki/Tree_of_life_(biology), a diagram that connects all species in a single structure (let’s leave aside for now the question of whether a tree is actually the best representation). My goal here is to outline a way of navigating the tree of life, specifically the Open Tree of Life.

Given a tree with some 2 million species, the obvious question is how can we visualise it? There are several projects that can accommodate trees of this size, such as Vienne’s LifeMap, Rosindell’s OneZoom, and Taxonium. Each of these viewers is impressive in their own way, but in my opinion each has problems. LifeMap treats the tree as a static structure in 2D space and uses tiles to enable the user to zoom in and out in the same way we navigate a digital map. Because trees are mostly empty space it is easy to get lost. OneZoom uses an almost hypnotic fractal tree layout, coupled with zooming in and out - a similar approach to LifeMap but with a different way to render the tree. It is fun, but the fractal pattern distorts aspects of the tree. Taxonium takes a different approach, the complete tree is rendered in 2D and is uniformly zoomed on the y-axis, stretching it out.

None of these projects has felt satifsfactory to me. They often don’t use the screen area efficiently, labels can be hard to read, and they treat tree visualisation as simply scaling or stretching a fixed layout. Open Tree of Life itself has a viewer tries a different approach to showing the tree, collapsing various nodes, but it feels clunky in comparison to the other viewers. This is a pity, because the Open Tree of Life is a fascinating project, a supertree that is regularly(ish) updated with new phylogenies, and which links to evidence for each node in that tree.

For a while I’ve been exploring a method called summary trees to display large trees, such as taxonomic classifications (based on work by Karloff and Shirley). The key feature of a summary tree is that you collapse a tree to a specified number of nodes (or leaves), which means you can ensure that the tree fits into your display space, and hence that all labels are legible. The trick is to figure out what nodes to collapse. I’ve used the approach of Libin et al. that partitions a tree based on a score given to each node.

This is a nice idea, but if you fit the tree of life into a browser window say, 30 lines high, then how do you see the rest of the tree? One approach would be to treat growing the tree as a form of zooming, so that one level of zoom would grow the tree to twice the size, and so on, and you would then have to pan to see the whole tree. I think this has potential for individual phylogenies, but for really big trees you just end up getting lost.

Instead, what if you clicked on a node in the tree and that node became the root of a new tree that you could explore, and that tree would be guaranteed to fit in your window? So you browse through the tree, making different parts fan out or collapse as needed.

This seemed appealing, but animating the transition between trees felt rather beyond my programming skills… so I asked ChatGPT and Claude for help. Part of the challenge to problem solving is understanding what the actual problem is. ChatGPT introduced me to the idea of a “transition scene” where you have the before and after trees, and you compute how one transforms into the other. Claude Code made this a reality, and now I could smoothly navigate around the tree. Obviously, starting at the root of the whole tre everytime would get tedious, so I added a simple search tool to find a node in the tree to start from.

So we have a the notion of collapsing a tree to a given size (summary trees), a way to decide what nodes to collapse (a combination of a scoring scheme and a priority queue), and we use transition scenes to move between trees. You can see the result of all this here: https://iphylo.org/ott-viewer.

Having got a browseable tree working, the next issue is how do you go “back”, and what does “going back” even mean? We can wire up the browser’s back button to take you back to the previous tree, but I wanted something more. I’d come across a paper that described “Hoptrees” which shows your navigation history not as a simple linear list of where you have been, but arranges that history as a tree. This felt like a natural fit for navigating the tree of life, and hence above the tree you will see your navigation history as a simplified version of the larger tree.

As always there is more that could be done, but this feels like a natural stopping point. The tree browser works, and when I use it I spend less time thinking about the interface and more about the relationships in the tree, and that feels as it should be.

References

Brooks, M., West, J. D., Aragon, C. R., & Bergstrom, C. T. (2013). Hoptrees: Branching History Navigation for Hierarchies. In P. Kotzé, G. Marsden, G. Lindgaard, J. Wesson, & M. Winckler (Eds), Human-Computer Interaction – INTERACT 2013 (pp. 316–333). Springer. https://doi.org/10.1007/978-3-642-40477-1_20

Karloff, H., & Shirley, K. E. (2013). Maximum Entropy Summary Trees. Computer Graphics Forum, 32(3pt1), 71–80. https://doi.org/10.1111/cgf.12094

Libin, P., Vanden Eynden, E., Incardona, F., Nowé, A., Bezenchek, A., EucoHIV Study Group, Sönnerborg, A., Vandamme, A.-M., Theys, K., & Baele, G. (2017). PhyloGeoTool: Interactively exploring large phylogenies in an epidemiological context. Bioinformatics, 33(24), 3993–3995. https://doi.org/10.1093/bioinformatics/btx535

Page, R. D. (2012). Space, time, form: Viewing the Tree of Life. Trends in Ecology & Evolution, 27(2), 113–120.

Sanderson, T. (2022). Taxonium, a web-based tool for exploring large phylogenetic trees. eLife, 11, e82392. https://doi.org/10.7554/eLife.82392

De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. https://doi.org/10.1371/journal.pbio.2001624

Wong, Y., & Rosindell, J. (2022). Dynamic visualisation of million‐tip trees: The OneZoom project. Methods in Ecology and Evolution, 13(2), 303–313. https://doi.org/10.1111/2041-210X.13766

Monday, May 04, 2026

Alpha shapes and DNA barcoding

How to cite: Page, R. (2026). Alpha shapes and DNA barcoding. https://doi.org/10.59350/qx8j9-vam77

DNA barcoding generates a lot of specimen data with geographical coordinates (see for example Guest post: response to “Putting GenBank Data on the Map”). The question naturally arises: “how accurate are those coordinates?”.

Browsing the BOLD database using BOLD View I often come across sequences whose coordinates are labelled “Coordinates from country centroid”, so these may bear little relation to where the specimen was actually collected. But how can we assess the accuracy of other coordinates?

Inspired by a 2008 Flickr blog post The Shape of Alpha I decided to create plots of the distribution of geotagged specimens in the BOLD database, grouped by geographic level. For example, we could aggregate all points labelled as being from the country “India”, then subset those into points labelled as being from various regions within India, and so on down the geographic hierarchy implied by country, province, etc. Rather than plot all the points, I decided to sumamrise them using the same approach Flickr used, we enclose the points in an alpha shape. Below are examples for India.

The two maps differ in how closely the curve fits the points, which is determined by the value of alpha (α) used to compute the shape. The smaller the value the tighter the fit. The first map used α=0.3 and is fairly coarse, with α=0.1 we see the alpha shape skirts around Bangladesh, and is hence a better representation of the boundary of India.

The original Flickr blog post was showing how well geotagged photographs on Flickr were tracing out geographical areas. From my perspective, one reason to make these maps is to spot problematic records. For example, the map for Tasmania looks a bit strange. There are records on the Australian mainland, and Lord Howe and Macquarie Islands that clearly aren’t from “Tasmania”. Maybe the coordinates are wrong, maybe the placename is wrong? Either way, we now have some records to investigate.

This project is live on the BOLD View web site, it was mostly written using Claude Code, making use of the GIS features in Postgres. It is an example of how easy AI tools make it to do some quick exploration of an idea (in this case, something inspired by a blog post that is nearly twenty years old).

Wednesday, March 18, 2026

SimpleMappr is dead, long live SimpleMappr?

How to cite: Page, R. (2026). SimpleMappr is dead, long live SimpleMappr? https://doi.org/10.59350/20dk7-8ns92

David Shorthouse, perhaps best know for his fabulous Bionomia project is also the author of SimpleMappr , a web site for generating publication-ready species distribution maps. These maps have appeared in many publications, and also pop up in iNaturalist.

David has announced that SimpleMappr will be turned off. Obviously not an easy decision for him to take, and sadly yet more evidence of the fragility of a lot of taxonomic infrastructure (as seen in the struggles of both BHL and TAXACOM).

I don’t use SimpleMappr, but I know that lots of people do, and so I wondered how easy it would be to create a new version (based on David’s code) that could be hosted either on a central site or on people’s own computers The short answer is that it is “easy”, so long as your definition of “easy” includes (a) getting Claude Code to do the bulk of the work, and (b) ignoring most of the more sophisticated features of David’s app.

Long story short, I have a (somewhat) working version of SimpleMappr running on a cloud server at https://simplemappr.cloud, source code here.

As with my previous post, this project involved forking the original code, asking Claude to read it, and sketch out a way to move it to a more robust setting, in this case using Docker containers. Early days, but I am delighted how easy (for various values of “easy”) it is to breath new life into old projects.

Using AI to revive a macOS app to preview GIS files

How to cite: Page, R. (2026). Using AI to revive a macOS app to preview GIS files. https://doi.org/10.59350/rb118-6m142

About a decade agho when I was working with GIS files, such as shapefiles, there was a nice QuickLook plugin for Macs called 1. GISLook that would show you the corresponding map as an icon.

macOS keeps evolving, as a result apps become obsolete unless they are continually updated. For small, solo developer projects, this often means the app no longer works. If the code is open source, at it is in this case, then potentially somebody can come a long and revive the project. But, realistically this can be a daunting prospect. I last wrote native macOS code about two decades ago, a lot has changed.

Indeed, a lot has changed. With tools such as Claude Code, it is possible to point an AI at an old repository and, in effect, say, “build this, but for today’s Macs”.

In this case, I cloned the original repo, asked Claude to take a look, and then created a new repo rdmpage/gis-quicklook and Claude got to work. Of the original code, only the core file reading functions survive, the interface code has all gone. But after about a day’s messing about, I have a new app that has even more functionality because it supports the GeoJSON format as well.

You can get the app from the GitHub repo. Note that because it isn’t the App Store you will need to run a onetime command in terminal to get it to work:

xattr -cr GISLookApp.app

Here are four GIS files and their preview icons.

I should thank the original author, Bernie Jenny at Monash University in Melbourne. If you are at all interested in maps, globes, and cartography, you should look at his web page, it has some very cool stuff. I coundln’t have done this project without his open source (GPL 3.0) code.

Nor could I have done it without Claude Code. The level of debugging involved in this project was insane, there were log files flying past, Finder cache rebuilds, numerous dead ends and subtle “gotchas”, never mind the obstacle of learning how to support custom Finder icons and previews on a modern Mac.

This is the larger point, AI makes it possible, at least in principle, to look at an old, abandoned project, perhaps targeting an out of date API, and have a realistic chance of reviving it. That is a real game changer, made possible by a combination of open source and state of the art AI.

Tuesday, March 10, 2026

Using AI to understand a DNA barcoding mystery

How to cite: Page, R. (2026). Using AI to understand a DNA barcoding mystery. https://doi.org/10.59350/nbsfn-91m72

As I continue to add features to BOLD-View I keep coming across interesting cases where something seems not quite right with the data. Typically this prompts further investigation, which typically means going down a rabbit hole. As an example, take barcode GMAEA6199-22 which was collected by Donald Hobern in Australia and is identified to order level as Strepsiptera (I thank Donald for this example).

This barcode falls within a cluster which contains sequences labelled as either strepsipteran (in many cases identified down to species, Elenchus varleyi) or hempiteran. Almost all the strepsipteran sequences are assigned to a BIN BOLD:ACH2898. The hemipteran sequences are not assigned to a BIN, even though they are essentially identical to the strepsipteran sequences.

A new feature I’ve added to BOLD View’s display for a barcode is a map of all similar sequences, grouped by BINs. For this example the Stepsiptera BIN has a wide geographic range, including central America and South Africa, as well as Australia (through GMAEA6199-22 which is a stepsipteran but not included in the BIN).

So, we have a widespread BIN, likely to be Elenchus varleyi, which has identical sequences to those labelled as hemipteran. What is going on?

Now if you know anything about Strepsiptera you’ll know that they have a pretty bonkers biology, being endoparasites of insects, and male and females have quite different life histories. But as an exercise I decided to ask Claude to see what it made of this situation. Here is the prompt I used.

I also uploaded the tree picture you see above, and the paper “Descriptions and biological notes of the Australian Elenchidae (Strepsiptera)” doi:10.1071/IT9890175 which described Elenchus varleyi (I got the DOI for the paper from another of my projects, BioNames.

The paper is behind a paywall, and isn't on SciHub, but is on [ResearchGate](https://www.researchgate.net/publication/248899086), so I downloaded the PDF and used Marker to convert it to Markdown, which has become the default language of choice for sending documents to AIs.

Claude thought for a bit and then came back with a summary that basically said:

the mixture of strepsipteran and hemipteran sequences is likely due to the “hemipteran” barcodes actually being for their strepsipteran parasites rather than the hemipterans themselves.
give that one of the hosts of Elenchus varleyi is Sogatella kolophon, which is widespread, that would account for the occurrence of essentially identical DNA sequences across separate continents.

Claude noted that a single hemipteran can host multiple strepsipteran individuals, increasing the chance of mistakenly sequencing the pareasite rather than the host. The picture below shows adult neotenic females ♀ of the Elenchus varleyi in the abdomen of the hemipteran Sogatella kolophon. (Fig. 14 from doi:10.1071/IT9890175).

Now, you and I could figure this out with some background biology and reading the paper, but the point here is that Claude could do it more or less automatically, which means we could potentally scale this to look at all “odd” or “anomalous” barcodes and get suggestions as to what has happened. Hence we have the possibility for large-scale database curation (see also “CurateGPT: A flexible language-model assisted biocuration tool” doi:10.48550/arXiv.2411.00046).

That said, we need some things in place. It helps to have links to information on the species and sequences involved, e.g. what paper(s) described the species, what paper(s) published the sequences. In this case I used BioNames to find the DOI for the paper describing Elenchus varleyi which was behind a paywall. That is likely to be an obstacle to this approach.

Out of curiousity I explored an alternative scenario, in part inspired by Plazi. Imagine that we can’t share the full text of the paper, but we do have access to the facts or claims the paper makes. Plazi doesn’t quite do this, it extract images and the text of taxonomic treatments, but let’s imagine we could extract a series of triples from the paper, such as:

(Elenchus varleyi, parasitises, Sogatella kolophon)

Given a set of these we could potentially still infer the cause of the barcoding patterns we were seeing, but without needing access to the paywalled text (we’d need access to the text to extract the triples, but Claude wouldn’t need to see the text). Of course, this is not a new idea (Peter Murray-Rust has been advocating for this for a long time).

But what triples should we extract from a given paper. Out of interest I asked Claude what triples it would need to generate the answer it gave me to the strepsiptera/hemiptera example. Here is what it gave me:
Parasite–host relationships:

(Elenchus varleyi, parasitises, Sogatella kolophon)
(Elenchus varleyi, parasitises, Toya drope)
(Elenchus varleyi, host_family, Delphacidae)
(Sogatella kolophon, is_majority_host_of, Elenchus varleyi)

Life cycle and tissue intimacy (critical for the contamination argument):

(Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
(Elenchus varleyi female, life_strategy, neotenic/larviform)
(Elenchus varleyi female, spends_entire_life_in, host body)
(Elenchus varleyi male, develops_as, endoparasitic larva)
(Elenchus varleyi female, extrudes_only_in, adult hosts)

Superparasitism (strengthens contamination argument):

(Elenchus varleyi, exhibits, superparasitism)
(superparasitism, maximum_observed, six parasitoids per host)
(superparasitism, frequency, common in Brisbane)

Dispersal mechanism:

(Elenchus varleyi female, dispersed_by, macropterous adult hosts)
(Elenchus varleyi larvae, dispersed_by, macropterous adult hosts)
(Elenchus varleyi male pupa, extrudes_in, both nymphs and macropterous adults)

Taxonomic and geographic scope:

(Elenchus, distribution, cosmopolitan)
(Elenchus, only_parasitises, Delphacidae)
(Elenchus varleyi, type_locality, Brisbane Queensland)
(Elenchus varleyi, distribution, Queensland NSW Canberra Northern Territory)

Supporting observation (supporting wide distribution argument):

(Perkins 1907, observed, Elenchus from widely separated countries show very little difference)

Now I think that some of these triples we could dispense with. I asked Claude and it said the following would be enough:

(Elenchus varleyi, parasitises, Sogatella kolophon)
(Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
(Elenchus varleyi, host_family, Delphacidae)
(Elenchus varleyi female, dispersed_by, macropterous adult hosts)
(Elenchus, distribution, cosmopolitan)

These are fairly simple statements that could liklely be extracted from the paper. That raises issues of how we do that, and how we express those triples. I asked Claude for suggested vocabularies, it mentioned Darwin Core and the Relations Ontology, among others. I am congenitally averse to big, verbose ontologies so I’d prefer something light weight, and maybe Darwin Core would be enough?

So, in summary, I’m encouraged by the way Claude suggested a plausible explanation for the pattern in the barcode tree, and that it might not always need access to full text to do so (although I suspect giving an LLM access to full text is likely to beat giving it a set of triples that might not encompass all the relevant information in the paper). This also gives me a further incentive to work on the problem of providing context for each barcode, especially the scientific papers that published the sequences, and the papers that published the taxonomic names.

But one problem still remains. How do we get all this information back into BOLD so that a user looking at these sequences knows what is going on, knows that "Hemiptera" doesn't mean "Hemiptera" in this case, and that what we are seeing is a case of a widespread insect host being infected by a widespread parasite, which was originally described from Australia. The ability to add annotations and thrid party analyses will become crucial if people are to get the most out of DNA barcoding databases.

References

Caufield, H., Kroll, C., O’Neil, S. T., Reese, J. T., Joachimiak, M. P., Hegde, H., Harris, N. L., Krishnamurthy, M., McLaughlin, J. A., Smedley, D., Haendel, M. A., Robinson, P. N., & Mungall, C. J. (2024). CurateGPT: A flexible language-model assisted biocuration tool (arXiv:2411.00046). arXiv. https://doi.org/10.48550/arXiv.2411.00046
Kathirithamby, J. (1989). Descriptions and biological notes of the Australian Elenchidae (Strepsiptera). Invertebrate Taxonomy, 3(2), 175–195. https://doi.org/10.1071/it9890175

Sunday, February 15, 2026

GBIF Geocoder: using GBIF to find places on a map

How to cite: Page, R. (2026). GBIF Geocoder: using GBIF to find places on a map https://doi.org/10.59350/7g6pt-3mz06

I’ve relaunched a “toy” tool that I made a while ago to help geocode localitiies using GBIF. Geocoding converts a text string, such as “Cambodia: Ratanakiri Province” into latitude and longitude coordinates. For some reason, the biodiversity community typically refers to this as “georeferencing”, which is usually defined as locating an image of a map (see Wikipedia entry for georeferencing, and Allmaps for some great examples).

You can try GBIF Geocoder at https://rdmpage.github.io/gbif-geocoder/. Code is available on GitHub at https://github.com/rdmpage/gbif-geocoder.

The idea behind the “GBIF Geocoder” is that GBIF has a huge number of geocoded specimens, and hence if you are looking for coordinates for a locality there is a good chance that somebody has already found them. So, all we need to do is search GBIF for specimens with localities that match the place you are trying to geocode. I created a version of this tool in 2018, mentioning it in a blog post GBIF at 1 billion - what’s next?, and wrote it up in a short note in bioRxiv Geocoding genomic databases using GBIF.

The original version was hosted on Glitch, a wonderful platform where people to create pretty much anything using HTML and Javascript. Glitch is no more so I’ve finally got around to rebuilding it, inspired by this post on Bluesky by Tapani Hopkins:

The original project used node.js, whereas I wanted something simple using just HTML and Javascript so it could be hosted using GitHub pages (or, indeed, on any other static hosting platform). I fired up Claude Code to help me with the port. I continue to be amazed at just how much fun this style of coding is, and the power of the tools. I make requests and suggestions, and Claude will fire up an instance of Google Chrome to check that the code works. I think a key feature of this style of programming is that it can reduce that inital hurdle when you know you need to make changes, and may even have made notes to yourself about what needs to be done, but there will the initial tedium of reworking old code to work with a new platform i.e., Googling questions, re-reading GitHub docs, etc. Instead, I get to focus on what I want to do, namely revive an old tool that I think people may find useful.

Wednesday, November 19, 2025

Model Context Protocol (MCP) and triple stores: natural language queries for knowledge graphs

Some quick notes based on experiments with Model Context Protocol (MCP) and (Claude](https://claude.ai).

Model Context Protocol (MCP) is all the rage right now, and I’ve been slow to take a look. Kingsley Idehen recently wrote The Semantic Web Project Didn’t Fail — It Was Waiting for AI (The Yin of its Yang) where he argued that Large Language Models (LLMs) provide (finally) a user-friendly way to query triple stores (i.e., knowledge graphs). Instead of simply presenting users with an empty SPARQL query box, we can now formulate a query in natural language and have AI convert that into SPARQL.

That eases the challenge of learning a new query language, but it get’s better. MCP enables us to connect an AI with another service. It acts a bit like a broker. You tell the AI what you want to do, the AI talks to the MCP server to figure out how to do what you want, gets the results, then converts them into a natural language (or other format) result that you can use. hence you can have a conversation with a knowledge graph!

There are examples of MCP servers that speak SPARQL, such as MCP Server SPARQL by
Eric Zhu. Since I mostly program in PHP (gasp) version 7 (gasp) I ended up asking ChatGPT to help write a simple MCP server. There then followed a dance between ChatGPT and Claude where ChatGPT would very confidently declare that the code was done, and Claude would get increasingly exasperated that I appeared to be trying to do something that wasn’t working. I eventually had to tell Claude to back the f**k off with its snarky comments and maybe be more helpful in its messages. Eventually I got a simple server up and running.

The code php-mcp-server is very basic, but supports SPARQL queries running on an instance of Oxigraph that runs on my Mac. For example, I can ask:

and Claude will respond:

It will also show me the SPARQL queries it makes to find this information.

This feels like a game changer. The MCP server I’ve written is incredibly crude, but I can now start to query a knowledge graph about DNA barcodes and associated literature in plain English, and get back useful results.

What I really want to do is combine this with details on the actual papers (for example, lists of specimens sequenced, whether they are type specimens, where were the samples collected from, etc.) as a way to help curate databases such as BOLD. I recently released BOLD View (see blog post BOLD View: exploring DNA barcodes) to make it easier to explore DNAbarcode data, and I’m fascinated by how much scope there is for curation to add taxonomic identifications, geographic location, etc.

To make this curation eassier I’ve started to assemble a knowledge graph linking barcodes, Genbank sequences, and taxonomic names to the associated scientific literature, with the ultimate goal of being able to ask: “given this barcode that lacks a proper scientific name, is there anything in the published literature that can tell me what it actually is?”. The idea of being able to literally ask that question using a combination of an AI and a MCP server is vert exciting.

Friday, June 05, 2026

Q: I want to make sense of this paper on BIOSCAN-5M.

Q: I’d prepared a Markdown version with figures — can you use it?

Q: (Figures) — the look-alikes, the geography, the clustering, retrieval.

Q: Show me Figure 2.

Q: In Table 5, what are “query and key combinations,” and the “Aligned embeddings” ticks?

Q: So images alone are poor at species level but improve when aligned with DNA?

Q: Were unseen sequences used in any training?

Q: Why does DNA + images improve DNA-only querying, when images are poor at barcodes?

Q: So to improve an image species-classifier, align image and DNA embeddings?

Q: So it’s hard to generalise beyond the images this system collects?

Q: They had geographic data (Fig 10) but didn’t use it — interesting.

Q: Could embeddings flag labelling errors (e.g. a bug inside a fly species)?

Q: (Figures 14–16) — pairwise barcode distance distributions.

Q: Can the “suspect list” be extracted easily?

Q: Summarise contrastive learning as used here.

Q: SSL — what is it actually learning?

Q: Is it like ordination on a sequence-distance matrix, just higher-dimensional?

Q: Do the three encoders share an embedding size? Is that what makes it multimodal?

Q: What do “linear probe” and “1NN” probes mean (Table 4)?

Q: (Table 5 in full.)

Q: Summary — DNA beats photos of dead insects under ring lights?

Q: Generous conclusion — a DNA+image model identifies new image samples better than image-only?

Q: Do the authors themselves clearly state that conclusion?

Q: Do they state anything like it in the abstract?

Wednesday, May 13, 2026

References

Monday, May 04, 2026

Wednesday, March 18, 2026

Tuesday, March 10, 2026

References

Sunday, February 15, 2026

Wednesday, November 19, 2025