Wednesday, March 18, 2026

SimpleMappr is dead, long live SimpleMappr?

David Shorthouse, perhaps best know for his fabulous Bionomia project is also the author of SimpleMappr , a web site for generating publication-ready species distribution maps. These maps have appeared in many publications, and also pop up in iNaturalist.

David has announced that SimpleMappr will be turned off. Obviously not an easy decision for him to take, and sadly yet more evidence of the fragility of a lot of taxonomic infrastructure (as seen in the struggles of both BHL and TAXACOM).

I don’t use SimpleMappr, but I know that lots of people do, and so I wondered how easy it would be to create a new version (based on David’s code) that could be hosted either on a central site or on people’s own computers The short answer is that it is “easy”, so long as your definition of “easy” includes (a) getting Claude Code to do the bulk of the work, and (b) ignoring most of the more sophisticated features of David’s app.

Long story short, I have a (somewhat) working version of SimpleMappr running on a cloud server at https://simplemappr.cloud, source code here.

As with my previous post, this project involved forking the original code, asking Claude to read it, and sketch out a way to move it to a more robust setting, in this case using Docker containers. Early days, but I am delighted how easy (for various values of “easy”) it is to breath new life into old projects.

Written with StackEdit.

Using AI to revive a macOS app to preview GIS files

About a decade agho when I was working with GIS files, such as shapefiles, there was a nice QuickLook plugin for Macs called 1. GISLook that would show you the corresponding map as an icon.

macOS keeps evolving, as a result apps become obsolete unless they are continually updated. For small, solo developer projects, this often means the app no longer works. If the code is open source, at it is in this case, then potentially somebody can come a long and revive the project. But, realistically this can be a daunting prospect. I last wrote native macOS code about two decades ago, a lot has changed.

Indeed, a lot has changed. With tools such as Claude Code, it is possible to point an AI at an old repository and, in effect, say, “build this, but for today’s Macs”.

In this case, I cloned the original repo, asked Claude to take a look, and then created a new repo rdmpage/gis-quicklook and Claude got to work. Of the original code, only the core file reading functions survive, the interface code has all gone. But after about a day’s messing about, I have a new app that has even more functionality because it supports the GeoJSON format as well.

You can get the app from the GitHub repo. Note that because it isn’t the App Store you will need to run a onetime command in terminal to get it to work:

xattr -cr GISLookApp.app

Here are four GIS files and their preview icons.

I should thank the original author, Bernie Jenny at Monash University in Melbourne. If you are at all interested in maps, globes, and cartography, you should look at his web page, it has some very cool stuff. I coundln’t have done this project without his open source (GPL 3.0) code.

Nor could I have done it without Claude Code. The level of debugging involved in this project was insane, there were log files flying past, Finder cache rebuilds, numerous dead ends and subtle “gotchas”, never mind the obstacle of learning how to support custom Finder icons and previews on a modern Mac.

This is the larger point, AI makes it possible, at least in principle, to look at an old, abandoned project, perhaps targeting an out of date API, and have a realistic chance of reviving it. That is a real game changer, made possible by a combination of open source and state of the art AI.

Written with StackEdit.

Tuesday, March 10, 2026

Using AI to understand a DNA barcoding mystery

As I continue to add features to BOLD-View I keep coming across interesting cases where something seems not quite right with the data. Typically this prompts further investigation, which typically means going down a rabbit hole. As an example, take barcode GMAEA6199-22 which was collected by Donald Hobern in Australia and is identified to order level as Strepsiptera (I thank Donald for this example).

This barcode falls within a cluster which contains sequences labelled as either strepsipteran (in many cases identified down to species, Elenchus varleyi) or hempiteran. Almost all the strepsipteran sequences are assigned to a BIN BOLD:ACH2898. The hemipteran sequences are not assigned to a BIN, even though they are essentially identical to the strepsipteran sequences.

A new feature I’ve added to BOLD View’s display for a barcode is a map of all similar sequences, grouped by BINs. For this example the Stepsiptera BIN has a wide geographic range, including central America and South Africa, as well as Australia (through GMAEA6199-22 which is a stepsipteran but not included in the BIN).

So, we have a widespread BIN, likely to be Elenchus varleyi, which has identical sequences to those labelled as hemipteran. What is going on?

Now if you know anything about Strepsiptera you’ll know that they have a pretty bonkers biology, being endoparasites of insects, and male and females have quite different life histories. But as an exercise I decided to ask Claude to see what it made of this situation. Here is the prompt I used.

I have an interesting DNA barcoding case I want you to look at. I am uploading part of a phylogeny for DNA barcodes for a strepsiteran Elenchus varleyi. Note that the barcodes mix sequences from Elenchus and Hemiptera. The barcodes come from samples in Australia, South Africa, and central America. I am uploading a Markdown file for a paper with the DOI 10.1071/it9890175. If you need to get information on species distributions then https://gbif.org is a good source. Given all this information, can you suggest explanations for why we have barcode sequences from Elenchus and Hemiptera, why only some sequences are assigned to a BIN, and why are these barcodes widely distributed.

I also uploaded the tree picture you see above, and the paper “Descriptions and biological notes of the Australian Elenchidae (Strepsiptera)” doi:10.1071/IT9890175 which described Elenchus varleyi (I got the DOI for the paper from another of my projects, BioNames.

The paper is behind a paywall, and isn't on SciHub, but is on [ResearchGate](https://www.researchgate.net/publication/248899086), so I downloaded the PDF and used Marker to convert it to Markdown, which has become the default language of choice for sending documents to AIs.

Claude thought for a bit and then came back with a summary that basically said:

  • the mixture of strepsipteran and hemipteran sequences is likely due to the “hemipteran” barcodes actually being for their strepsipteran parasites rather than the hemipterans themselves.
  • give that one of the hosts of Elenchus varleyi is Sogatella kolophon, which is widespread, that would account for the occurrence of essentially identical DNA sequences across separate continents.

Claude noted that a single hemipteran can host multiple strepsipteran individuals, increasing the chance of mistakenly sequencing the pareasite rather than the host. The picture below shows adult neotenic females ♀ of the Elenchus varleyi in the abdomen of the hemipteran Sogatella kolophon. (Fig. 14 from doi:10.1071/IT9890175).

Now, you and I could figure this out with some background biology and reading the paper, but the point here is that Claude could do it more or less automatically, which means we could potentally scale this to look at all “odd” or “anomalous” barcodes and get suggestions as to what has happened. Hence we have the possibility for large-scale database curation (see also “CurateGPT: A flexible language-model assisted biocuration tool” doi:10.48550/arXiv.2411.00046).

That said, we need some things in place. It helps to have links to information on the species and sequences involved, e.g. what paper(s) described the species, what paper(s) published the sequences. In this case I used BioNames to find the DOI for the paper describing Elenchus varleyi which was behind a paywall. That is likely to be an obstacle to this approach.

Out of curiousity I explored an alternative scenario, in part inspired by Plazi. Imagine that we can’t share the full text of the paper, but we do have access to the facts or claims the paper makes. Plazi doesn’t quite do this, it extract images and the text of taxonomic treatments, but let’s imagine we could extract a series of triples from the paper, such as:

  • (Elenchus varleyi, parasitises, Sogatella kolophon)

Given a set of these we could potentially still infer the cause of the barcoding patterns we were seeing, but without needing access to the paywalled text (we’d need access to the text to extract the triples, but Claude wouldn’t need to see the text). Of course, this is not a new idea (Peter Murray-Rust has been advocating for this for a long time).

But what triples should we extract from a given paper. Out of interest I asked Claude what triples it would need to generate the answer it gave me to the strepsiptera/hemiptera example. Here is what it gave me:
Parasite–host relationships:

  • (Elenchus varleyi, parasitises, Sogatella kolophon)
  • (Elenchus varleyi, parasitises, Toya drope)
  • (Elenchus varleyi, host_family, Delphacidae)
  • (Sogatella kolophon, is_majority_host_of, Elenchus varleyi)

Life cycle and tissue intimacy (critical for the contamination argument):

  • (Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
  • (Elenchus varleyi female, life_strategy, neotenic/larviform)
  • (Elenchus varleyi female, spends_entire_life_in, host body)
  • (Elenchus varleyi male, develops_as, endoparasitic larva)
  • (Elenchus varleyi female, extrudes_only_in, adult hosts)

Superparasitism (strengthens contamination argument):

  • (Elenchus varleyi, exhibits, superparasitism)
  • (superparasitism, maximum_observed, six parasitoids per host)
  • (superparasitism, frequency, common in Brisbane)

Dispersal mechanism:

  • (Elenchus varleyi female, dispersed_by, macropterous adult hosts)
  • (Elenchus varleyi larvae, dispersed_by, macropterous adult hosts)
  • (Elenchus varleyi male pupa, extrudes_in, both nymphs and macropterous adults)

Taxonomic and geographic scope:

  • (Elenchus, distribution, cosmopolitan)
  • (Elenchus, only_parasitises, Delphacidae)
  • (Elenchus varleyi, type_locality, Brisbane Queensland)
  • (Elenchus varleyi, distribution, Queensland NSW Canberra Northern Territory)

Supporting observation (supporting wide distribution argument):

  • (Perkins 1907, observed, Elenchus from widely separated countries show very little difference)

Now I think that some of these triples we could dispense with. I asked Claude and it said the following would be enough:

  • (Elenchus varleyi, parasitises, Sogatella kolophon)
  • (Elenchus varleyi female, is_obligate_endoparasite_of, adult host)
  • (Elenchus varleyi, host_family, Delphacidae)
  • (Elenchus varleyi female, dispersed_by, macropterous adult hosts)
  • (Elenchus, distribution, cosmopolitan)

These are fairly simple statements that could liklely be extracted from the paper. That raises issues of how we do that, and how we express those triples. I asked Claude for suggested vocabularies, it mentioned Darwin Core and the Relations Ontology, among others. I am congenitally averse to big, verbose ontologies so I’d prefer something light weight, and maybe Darwin Core would be enough?

So, in summary, I’m encouraged by the way Claude suggested a plausible explanation for the pattern in the barcode tree, and that it might not always need access to full text to do so (although I suspect giving an LLM access to full text is likely to beat giving it a set of triples that might not encompass all the relevant information in the paper). This also gives me a further incentive to work on the problem of providing context for each barcode, especially the scientific papers that published the sequences, and the papers that published the taxonomic names.

But one problem still remains. How do we get all this information back into BOLD so that a user looking at these sequences knows what is going on, knows that "Hemiptera" doesn't mean "Hemiptera" in this case, and that what we are seeing is a case of a widespread insect host being infected by a widespread parasite, which was originally described from Australia. The ability to add annotations and thrid party analyses will become crucial if people are to get the most out of DNA barcoding databases.

References

  • Caufield, H., Kroll, C., O’Neil, S. T., Reese, J. T., Joachimiak, M. P., Hegde, H., Harris, N. L., Krishnamurthy, M., McLaughlin, J. A., Smedley, D., Haendel, M. A., Robinson, P. N., & Mungall, C. J. (2024). CurateGPT: A flexible language-model assisted biocuration tool (arXiv:2411.00046). arXiv. https://doi.org/10.48550/arXiv.2411.00046
  • Kathirithamby, J. (1989). Descriptions and biological notes of the Australian Elenchidae (Strepsiptera). Invertebrate Taxonomy, 3(2), 175–195. https://doi.org/10.1071/it9890175

Written with StackEdit.

Sunday, February 15, 2026

GBIF Geocoder: using GBIF to find places on a map

How to cite: Page, R. (2026). GBIF Geocoder: using GBIF to find places on a map https://doi.org/10.59350/7g6pt-3mz06

I’ve relaunched a “toy” tool that I made a while ago to help geocode localitiies using GBIF. Geocoding converts a text string, such as “Cambodia: Ratanakiri Province” into latitude and longitude coordinates. For some reason, the biodiversity community typically refers to this as “georeferencing”, which is usually defined as locating an image of a map (see Wikipedia entry for georeferencing, and Allmaps for some great examples).

You can try GBIF Geocoder at https://rdmpage.github.io/gbif-geocoder/. Code is available on GitHub at https://github.com/rdmpage/gbif-geocoder.

The idea behind the “GBIF Geocoder” is that GBIF has a huge number of geocoded specimens, and hence if you are looking for coordinates for a locality there is a good chance that somebody has already found them. So, all we need to do is search GBIF for specimens with localities that match the place you are trying to geocode. I created a version of this tool in 2018, mentioning it in a blog post GBIF at 1 billion - what’s next?, and wrote it up in a short note in bioRxiv Geocoding genomic databases using GBIF.

The original version was hosted on Glitch, a wonderful platform where people to create pretty much anything using HTML and Javascript. Glitch is no more so I’ve finally got around to rebuilding it, inspired by this post on Bluesky by Tapani Hopkins:

Next quest: figure out from this map where "La Maboke" was. Searching for "Maboke, Central African Republic" turned out not to work like I'd hoped for. 😅 Though perhaps I should save the recipe. 🍲🐟

[image or embed]

— Tapani Hopkins (@tapani-hopkins.bsky.social) Feb 14, 2026 at 13:00

The original project used node.js, whereas I wanted something simple using just HTML and Javascript so it could be hosted using GitHub pages (or, indeed, on any other static hosting platform). I fired up Claude Code to help me with the port. I continue to be amazed at just how much fun this style of coding is, and the power of the tools. I make requests and suggestions, and Claude will fire up an instance of Google Chrome to check that the code works. I think a key feature of this style of programming is that it can reduce that inital hurdle when you know you need to make changes, and may even have made notes to yourself about what needs to be done, but there will the initial tedium of reworking old code to work with a new platform i.e., Googling questions, re-reading GitHub docs, etc. Instead, I get to focus on what I want to do, namely revive an old tool that I think people may find useful.

Written with StackEdit.