iPhylo: AI-assisted review of BIOSCAN-5M paper

Roderic D. M. Page

Friday, June 05, 2026

AI-assisted review of BIOSCAN-5M paper

The following is a conversation with Claude walking through the paper BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity (arXiv:2406.12723, doi:10.48550/arXiv.2406.12723). This is a big paper with lots of models and lots of jargon (as is typical of machine elarning papers), and I wanted to make sense of what the authors actually did, and what the practical outcome is. My gut instinct was that the photos provided are going to be next to useless at identifying insect species, DNA sequences are much better at that task, so what was the point? What I got from this conversation is that the papers shows that you can improve the ability to identify insects based on images if you have a model trained on both images and DNA sequences together, even if the images themselves don't look all that promising.

Q: I want to make sense of this paper on BIOSCAN-5M.

What it is. A dataset, not a new model — over 5 million arthropod specimens
(98% insects), each described across multiple modalities, plus benchmark
experiments. It’s the bigger successor to BIOSCAN-1M (~1.1M insects).

Each specimen has:

A microscope image (originally 1024×768, plus cropped and 256px versions).
A DNA barcode — ~648 bp of the COI gene, stored as a nucleotide string.
Taxonomic labels across seven ranks (phylum → … → species).
Geographic data (country, province, lat/long) and a size estimate.

Two problems baked into the data:

Massively long-tailed. Diptera (flies) are ~50% of everything; 63% of
specimens are from Costa Rica; two-thirds of species have ≤6 samples.
Mostly unlabelled at fine ranks. Over 80% have no label below family —
reframed as an opportunity for self-supervised / semi-supervised learning.

The data splits partition by a “species set” idea:

Seen — proper species name; normal train/val/test.
Unseen — known genus, placeholder species; tests handling a new species.
Heldout — placeholder genus and species.
Unknown — no species label (~91%); goes to the pretrain split.

Splits are made at the barcode level so sequences don’t leak across train/test.

The three experiments:

DNA classification (BarcodeBERT). A BERT-style model pretrained on
barcodes beat bigger general-purpose genomic models; closed-world species ID
~99% (BLAST alignment gets 99.78%). The real win was open-world genus
assignment for unseen species (~47%).
Zero-shot clustering. Cluster pretrained embeddings without fine-tuning.
DNA embeddings crushed image embeddings; combining them did worse than DNA
alone because the strong DNA signal got diluted.
Multimodal alignment (CLIBD). CLIP-style contrastive training puts images,
DNA, and text in one shared space, enabling cross-modal queries. Bigger data
improved coarse ranks; species stayed hard.

One-line takeaway: DNA barcodes are by far the most informative signal for
telling insect species apart, and the dataset’s value is a huge, realistically
messy, multimodal playground for the “millions of mostly-unknown species”
problem.

Q: I’d prepared a Markdown version with figures — can you use it?

(Clarification: the earlier summary came from fetching the arXiv abstract page,
which served the LaTeX source — no local file was read. The uploads folder was
empty at first; a ZIP was then uploaded.)

The ZIP turned out to be the v6 (published) version, with a few differences
from the arXiv text: the repo moved to bioscan-ml/BIOSCAN-5M, “BIOSCAN-CLIP” is
now CLIBD, and 1.2M specimens are labelled to genus/species. It contained the
full Markdown plus 35 figure images.

Q: (Figures) — the look-alikes, the geography, the clustering, retrieval.

Figure 18 (Diptera look-alikes): each row is one genus, three species side
by side (e.g. Agromyza albipennis / frontella / idaeiana) — near
indistinguishable in a dish photo. This is why image-only ID stalls at coarse
ranks and DNA is needed for species.
Figure 11 (geography): Costa Rica = 3,256,316 samples; next is South Africa
at 322,096 — a tenfold drop. One site dominates; a long tail of token coverage
elsewhere.
Figure 4 (zero-shot clustering AMI): image encoders slope down hard (DINO
~88% at class → ~10% at species); DNA encoders stay high (HyenaDNA ~97% → ~90%
at species/BIN). DNA wins at every rank and the gap explodes as ranks get finer.
(Correction: image encoders peak at the coarsest ranks and decline
monotonically — the older “best at family” claim came from a prior paper.)
Figure 9 (retrieval): colour-coded matches (green = same species, etc.).
DNA→DNA is almost all green; image→image picks up family/order-level matches.

Recurring theme: the DNA barcode carries almost all the species signal;
images are useful at coarse levels or as a queryable stand-in when DNA is absent.

Q: Show me Figure 2.

A 3×8 grid of original full-size specimen images, showing the diversity of the
collection (iridescent beetles, lacewings, ladybirds, wasps, ants, flies) — the
deliberate contrast to Figure 18’s look-alikes. Every specimen sits in a circular
dish on a pale background under controlled lighting, which is what makes the
cropping tool and pixel-count size estimate work. Telling a green beetle from a
yellow lacewing is the “easy” coarse-rank axis; the hard part only appears within
a genus.

Q: In Table 5, what are “query and key combinations,” and the “Aligned embeddings” ticks?

Query/key = the model works by retrieval, not direct classification. Keys
are a labelled reference database (each embedded); a query is the new specimen —
embed it, find the nearest key, copy that key’s label. “X-to-Y” = query type X,
keys type Y:

DNA-to-DNA — match a barcode against known barcodes.
Image-to-Image — match a photo against known photos.
Image-to-DNA — the cross-modal case; only possible because contrastive
training put images and DNA in one space. Baseline (no alignment) scores ~0%
here; after alignment it jumps to ~50–64% at genus — that delta is the point.

Aligned-embeddings ticks (Img / DNA / Txt) = which modalities were pulled
together during training (✓ = included). Each taxon block has four rows:
✗✗✗ baseline, ✓✓✓ at 1M, ✓✓✗ at 5M, ✓✓✓ at 5M. Reading ticks against numbers
gives the three claims: alignment helps hugely; more data helps; adding text
helps coarse ranks but slightly hurts species (only ~9% of records have species
labels).

Q: So images alone are poor at species level but improve when aligned with DNA?

Yes, with two refinements:

Alignment is at training time; you don’t need DNA at inference.
Image-to-Image species H.M. goes 16.6% (raw) → 67.5% (aligned). The DNA
reshaped the image space during training, then is gone at deployment.
"More accurate" ≠ “good” at species. 67.5% still trails DNA-to-DNA (~97%),
and unseen species lag seen (60.8 vs 75.9).

The ✓✓✗ row (image+DNA, no text) captures almost the whole gain, so DNA — not
text — is the driver of the image improvement.

Q: Were unseen sequences used in any training?

Mostly no — by design. Unseen species go to val_unseen / test_unseen /
key_unseen; CLIBD trains on pretrain + train only. No unseen label and no
unseen barcode (seen/unseen are de-duplicated by barcode) is trained on.

The wildcard: pretrain is the unknown set, which “may truly belong in any
of the other three categories,” and barcode de-dup is described only for
seen/unseen. So an individual that is biologically an unseen species could sit
unlabelled in pretrain. Clean label hold-out; slightly leakier sequence
hold-out (the DNA encoder also had upstream BOLD pretraining).

Why seen/unseen differ though both are species-level: (1) exposure — seen
species shaped the embedding space; unseen require generalisation; (2)
partition composition — unseen partitions are smaller and skewed toward
Hymenoptera/Lepidoptera, so the macro-averages aren’t head-to-head. At coarse
ranks unseen sometimes scores higher; only at species does the exposure penalty
make unseen < seen.

Q: Why does DNA + images improve DNA-only querying, when images are poor at barcodes?

“Poor at classifying barcodes” and “useful as a training signal” measure
different things.

The contrastive loss pairs each specimen’s DNA embedding with that same
specimen’s image — there’s no DNA-to-DNA positive pair. So the DNA encoder’s
signal is “sit where the image says this specimen sits, away from others.” The
image is a shared coordinate frame. DNA-to-DNA improves partly because the space
gets reorganised around specimen identity (a retrieval-shaped objective)
instead of masked-language-modelling token statistics.

Image-to-DNA scoring ~20% at species is a point-to-point test; it says nothing
about whether images provide useful aggregate grouping during training. Images
are a fuzzy-but-correlated teacher: roughly right at coarse structure, vague at
species. Averaged over millions of pairs, that regularises the geometry; the DNA
fills in the fine detail the image couldn’t.

Caveats: no “DNA fine-tuned contrastively with no image partner” row exists,
so “image contribution” can’t be cleanly isolated from “contrastive fine-tuning
in general”; and pulling DNA toward the weaker image space can slightly cost
precision at the finest ranks.

Q: So to improve an image species-classifier, align image and DNA embeddings?

Yes — the practical takeaway. Pay the DNA cost once at training; deploy a better
image-only model. Evidence: Image-to-Image species 16.6% → 67.5%.

Three bounding conditions:

You need paired data (photo + barcode for the same specimen). Can’t
retrofit onto a photo-only archive.
The lift is real but the ceiling is low at species (~67%); alignment makes
images genuinely strong only at coarse ranks (low-90s).
Attribute carefully — much of the jump is “contrastive in-domain training
at all”; DNA is the best partner, not provably the sole cause.

Recipe: get a paired image+barcode set for your taxa, fine-tune the image encoder
contrastively against the DNA, then ship the image model alone. Pilot on your own
imaging conditions first.

Q: So it’s hard to generalise beyond the images this system collects?

Two separate questions:

Across species, same imaging setup: reasonably supported (that’s what
seen/unseen tests).
Across imaging conditions: essentially untested, and that’s the real gap.
Every specimen is a posed, dead arthropod in a standardised dish, fixed scale,
consistent lighting. The size features and cropping tool all depend on that. A
field photo (live insect, clutter, occlusion, unknown scale) is a different
distribution on every axis, and nothing in the benchmark measures the jump.

The alignment also transfers a specific signal — the lab-visible correlates of
species. The model learned “lab-visible correlates of species,” not “species.”
Strong recipe for specimen-imaging pipelines; speculative for in-the-wild
ID. Not a flaw — it’s a dataset/benchmark paper, upfront about standardised
imaging and geographic skew.

Q: They had geographic data (Fig 10) but didn’t use it — interesting.

Right — Figure 10 shows full geolocation (country/province/lat/long) per record,
yet none of the three experiments use it.

As a predictive feature it’s treacherous: 63% Costa Rica means a model would
learn collection base rates, not biology — inflated benchmarks, collapse on
deployment. A defensible reason to exclude it.
As scientific signal it’s exactly what you’d want: species have ranges;
geography can break ties between look-alikes that don’t co-occur, and powers
range-mapping / invasive detection — the dataset’s stated motivations.

Best read as a deliberate boundary (geography isn’t an encoder-friendly modality
for the contrastive story) rather than an oversight; the skew is flagged as a
limitation. A sharp follow-up: “does range data disambiguate look-alikes without
just memorising collection sites?”

Q: Could embeddings flag labelling errors (e.g. a bug inside a fly species)?

Yes — the zero-shot clustering AMI already measures label/structure agreement;
the disagreements are the candidate-error list. Three cases with different
signatures:

Misidentification — embedding sits inside cluster B but labelled A
(label–neighbourhood disagreement). Easiest to catch.
Wrong target sequenced (parasite/contaminant) — image says fly, DNA says
wasp (cross-modal disagreement). The multimodal angle is uniquely good here.
Genuine novelty — a coherent detached sub-cluster; a feature, not an error.

DNA gives a near-ground-truth reference (BIN system). A cheap first pass needs no
embeddings: flag specimens whose label disagrees with their BIN, or whose barcode
distance to conspecifics is abnormal (Figures 14–16). Embeddings earn their keep
for the cross-modal case. Caveats: image-based flags are noisy at species
(look-alikes); use a label-free off-the-shelf encoder to avoid circularity; treat
it as triage, not auto-correction.

Q: (Figures 14–16) — pairwise barcode distance distributions.

Box plots of within-group Damerau–Levenshtein (edit) distances:

Fig 14 (by class): medians ~150–200, long outlier tails (diverse classes
like Copepoda, Malacostraca).
Fig 15 (by order): medians come down and tighten.
Fig 16 (by species, 100 bins): the curve collapses toward zero almost
immediately — conspecific barcodes are nearly identical.

Monotonic story: class ≫ order ≫ species in internal distance, bottoming near
zero at species. That’s why barcoding works and why DNA crushed images at fine
ranks. The high-distance left-edge species bins and extreme outliers are the
actionable suspect residue.

Q: Can the “suspect list” be extracted easily?

Yes — the easiest version needs no sequences or distances, just a BIN ↔ species
cross-tabulation on two metadata columns:

One BIN → many species names = misID / inconsistent naming (the strongest
flag; the inverse of what curators caught manually).
One species → many BINs = cryptic complex, lumped label, or contaminant.

import pandas as pd
df = pd.read_csv("BIOSCAN_5M_Insect_Dataset_metadata.csv",
                 usecols=["species", "dna_bin", "processid"]).dropna(
                 subset=["species", "dna_bin"])
bins_per_species  = df.groupby("species")["dna_bin"].nunique().sort_values(ascending=False)
species_per_bin   = df.groupby("dna_bin")["species"].nunique().sort_values(ascending=False)

A second tier (Figure 16 signal) uses within-species edit distance, but
distance-to-consensus or BIN-count is far cheaper. Caveats: strip placeholder
names first; ignore singletons (require ≥3); it’s triage, not truth.

(Note: the uploaded ZIP held the paper + figures, not the 5M-record metadata, so
this recipe is for the metadata CSV from the dataset page.)

Q: Summarise contrastive learning as used here.

Teach the encoders that a specimen’s modalities belong together and different
specimens belong apart — pull matched pairs close, push mismatched apart.

Per specimen, embed image and DNA. The positive pair is the same specimen’s
image+DNA (rewarded close); negatives are that image with other specimens’
DNA (pushed apart). Over millions of specimens, two independently-pretrained
encoders get dragged into one shared space where “near” = “biologically
similar,” regardless of modality — enabling cross-modal queries.

Supervision is structural, not labelled (positive = same physical
specimen), so it runs on unlabelled pretrain data.
Explains the earlier paradox: the DNA acts as a coordinate frame teaching the
image space a finer organisation than pixels gave it.
Loss is NT-Xent (InfoNCE family); big batches matter (here, 2000) because
more negatives = better contrast.

Versus the others: supervised maps input→label; masked-modelling SSL learns one
modality’s internal structure; contrastive learns relationships across samples
and modalities — the only one that can build a single image/DNA/text space.

Q: SSL — what is it actually learning?

SSL invents its own labels from the data. BarcodeBERT masks ~50% of the barcode
tokens and predicts them from context; the ground truth is just the hidden
letters. To win, the model must learn the statistical structure of barcodes —
which is what distinguishes lineages. (Image SSL does the analogue with pixels /
crops.)

The output isn’t a class — it’s an embedding, and training makes the
embedding space organised: similar sequences land near each other. Nobody
defined “species”; relatedness drives similarity, so taxonomic structure emerges.
This is why unlabelled data is good fuel, why zero-shot clustering works, and why
error detection works (structure built independently of possibly-wrong labels).

Two stages: SSL builds a label-free map of “what’s similar to what”; a small
supervised step names regions of it. The data-hungry part is unlabelled; labels
are spent only on naming.

Q: Is it like ordination on a sequence-distance matrix, just higher-dimensional?

Good analogy — same “sequence space” where proximity = similarity, and taxa
appear as clusters. The real difference isn’t dimensionality:

Ordination uses a fixed, hand-chosen distance (edit distance, K2P…),
treating every position by the same rule.
SSL learns a context-aware distance — discovering which positions and
combinations carry information — and embodies it in a reusable function you
can project new sequences into (an ordination is a one-off layout).

Smaller distinctions: the transformer doesn’t need alignment (k-mer tokens +
learned position); the encoder generalises to unseen sequences immediately; and
the learned-embedding framing extends across modalities (one shared space), which
a single distance matrix can’t. The two would agree on coarse blobs and diverge
at the species boundary — exactly where a learned metric earns its keep.

Native sizes differ: ViT-B and BarcodeBERT are 768-d, BERT-small is 512-d
(and the paper notes HyenaDNA 128-d, NT 512-d elsewhere). They were pretrained
independently.

Alignment across different sizes uses a projection head per encoder — a small
learned layer mapping each native output to one common dimension (standard CLIP
design). So matching the final dimension is required; matching the encoders
is not.

But same size ≠ multimodal. Equal-length vectors only let you compute a
similarity; they don’t make it meaningful. Table 5’s ✗✗✗ baseline has
compatible sizes yet Image-to-DNA scores ~0% — the spaces were never aligned.
What makes it multimodal is the contrastive training that organises those
vectors into one shared, cross-comparable space (the 0% → ~90% jump). Dimension-
matching is plumbing; alignment is the thing. Corollary: you can add a modality
just by giving it an encoder + projection — no need to touch the others.

Q: What do “linear probe” and “1NN” probes mean (Table 4)?

Three ways to evaluate the same pretrained encoder, differing in how much you
may change it:

Fine-tuned — train the whole encoder + head on the task. Best accuracy
(~99%), but can’t tell whether pretraining or fine-tuning earned it.
Linear probe — freeze the encoder, train only a linear layer. Tests
whether the frozen embeddings are linearly separable → “how good are the raw
embeddings?” (DNABERT-S shines: 95.5%.)
1NN probe — no training at all; classify by nearest embedding. Purest test
of the embedding geometry.

Crucially the columns aren’t the same task: Fine-tuned and Linear probe are
closed-world seen-species; 1NN is open-world (“Unseen: Genus”) — assign an
unseen species’ barcode to the right genus, where you can’t train, so 1NN is
the only fair tool. That’s where the authors’ model leads dramatically (47.0% vs
~18–23%), because fine-tuned accuracy is saturated (BLAST alone hits 99.78%) and
doesn’t discriminate. Each probe strips away more help, so each is a stricter
test of what SSL baked in. (Table 6 uses the same three probes to pick the 8-4-4
architecture.)

Q: (Table 5 in full.)

Multimodal retrieval top-1 macro accuracy (%) on the test set. Each taxon block:
✗✗✗ baseline, then ✓✓✓ 1M, ✓✓✗ 5M, ✓✓✓ 5M. H.M. = harmonic mean of seen/unseen.

Taxon	Rec.	Img/DNA/Txt	DNA→DNA H.M.	Img→Img H.M.	Img→DNA H.M.
Order	—	✗✗✗	96.8	80.2	4.6
Order	5M	✓✓✓	100	96.8	94.4
Family	—	✗✗✗	91.2	53.8	0.4
Family	5M	✓✓✓	99.7	93.0	82.7
Genus	—	✗✗✗	86.2	32.9	0.0
Genus	5M	✓✓✓	99.2	85.0	56.1
Species	—	✗✗✗	79.8	16.6	0.0
Species	5M	✓✓✗	97.0	67.5	22.0
Species	5M	✓✓✓	97.0	67.1	20.3

(Condensed to H.M. columns; the full table also lists separate Seen/Unseen
values per group.) Three reads: ✗✗✗ → trained (Img→DNA 0 → high); 1M → 5M (more
data helps); ✓✓✗ vs ✓✓✓ at 5M (text helps coarse, slightly hurts species).

Q: Summary — DNA beats photos of dead insects under ring lights?

Faithful, and not cynical. DNA→DNA sits at 97–100% across ranks; image→image tops
out in the 60s at species; the cross-modal trick helps because DNA lends signal
to images, not vice versa.

Two refinements:

“DNA beats photos” is almost a tautology — a barcode is designed to separate
species. The interesting part is how far images can be dragged up the ranks
by alignment (usable at order/family).
DNA wins on accuracy but loses on cost and scale. Photos are cheap and
scalable; sequencing isn’t. The contribution is a principled way to spend the
expensive modality to make the cheap one better.

Bumper sticker: for telling species apart, sequence beats picture, and the
clever bit is using the sequence to teach the picture.

Q: Generous conclusion — a DNA+image model identifies new image samples better than image-only?

First half exactly right: identifying insect images improves if you train on
paired images + DNA for the same specimen.

One slip: DNA is a training ingredient, not a runtime input. In the
photo-ID scenario (Img→Img) there’s no DNA at inference — the improvement is baked
into the image encoder’s weights, then DNA is discarded. So you ship an
image-only model that was trained with DNA alignment. You never sequence the
new sample; you pay the DNA cost once on the training set.

(If you did have DNA for new specimens, you’d just run DNA→DNA at ~97% and skip
the photo — so the runtime-multimodal case isn’t really “identify new photos.”)

Precise version:

Training an image encoder jointly with paired DNA produces a better image-only
identifier than training on images alone — so for cheaply identifying new
insect photos, the DNA-aligned image model is the better tool, and you never
have to sequence the new samples.

Two footnotes: the comparison is against a raw image encoder, not an
identically-fine-tuned image-only one, so the DNA-specific contribution isn’t
cleanly isolated; and “new sample” holds within the same imaging setup — a field
photo is an untested distribution.

Q: Do the authors themselves clearly state that conclusion?

They state the building block, but more cautiously and more narrowly than the
packaged conclusion — and they never frame it as “identify previously unseen
images.” Our version is a fair extrapolation, not a liftable sentence.

The closest the paper comes is in the §4.4 Results: the full model “improves
classification accuracy for same-modality queries and enables cross-modality
queries. By aligning to DNA, our image embeddings are able to capture finer
details.” That clause endorses the mechanism we reached. But three gaps remain:

"Same-modality queries," not “unseen images.” They claim Image-to-Image
accuracy rises (and the unseen numbers do rise in Table 5), but never distil it
into a headline like “DNA+image training improves identification of new images.”
That synthesis is ours.
No deployment claim. The “throw the DNA away at inference / never sequence
the new sample” framing — the practically important part — is true of their
setup but left implicit. They describe a retrieval system, not a field tool.
Baseline is the raw encoder. They compare the aligned model to “the initial
pretrained encoders before multimodal contrastive learning,” i.e. an untrained
ViT — not an image-only model fine-tuned the same way. So “DNA specifically
causes the gain” stays our careful inference, not their stated result.

The formal Conclusion (§5) is more reserved still — dataset-paper language: it
lists “multimodal classification combining DNA, images, and taxonomic labels” as a
benchmark “demonstrating real-world applicability,” and stops. No sentence singles
out “DNA improves image classification” as a takeaway.

Net: they clearly state the building block (aligning to DNA lets image
embeddings capture finer detail and raises same-modality accuracy, including on
unseen species); they do not state the packaged conclusion (DNA+image training
yields a better image-only identifier for new samples, no sequencing at
deployment). The pieces and the mechanism are theirs; the finished sentence is the
reader’s.

Q: Do they state anything like it in the abstract?

No — the abstract is even more neutral than the body. Its only sentence on the
multimodal experiment: the contrastive learning “yields a general shared
embedding space enabling taxonomic classification using multiple types of
information and modalities.”

That’s a capability claim — the dataset enables a shared space and cross-modal
classification. It says nothing about improvement, nothing comparing DNA to
images, nothing about one modality teaching the other, and nothing about unseen
images. Even the “aligning to DNA → finer image detail” line from §4.4 doesn’t
make it into the abstract.

The claim weakens as you move outward toward what most people read:

Table 5 / Results (innermost): concrete — image accuracy rises, DNA
alignment sharpens image embeddings, reported on seen and unseen species.
§5 Conclusion: softer — multimodal classification as a benchmark
“demonstrating real-world applicability.”
Abstract (outermost): softest — just “enabling taxonomic classification
using multiple modalities.”

This is the genre, not evasion: a dataset paper advertises capability and leaves
interpretation of the numbers to the reader.

A single honest line for citation: the paper demonstrates that contrastively
aligning images with DNA produces a shared space in which same-modality image
retrieval improves over the raw encoder (including on unseen species) — but it
does not claim, in the abstract or conclusion, that DNA-trained models are a
better image-only identifier for new samples. That packaging is a defensible
inference from Table 5, not their stated conclusion.

Written with StackEdit.