Recently I’ve been working with a masters student, Maja Nagler, on a project using machine learning to identify images of Lepidoptera. This has been something of an adventure as I am new to machine learning, and have only minimal experience with the Python programming language. So what could possibly go wrong?
The inspiration for this project comes from (a) using iNaturalist’s machine learning to help identify pictures I take using their app, and (b) exploring DNA barcoding data which has a wealth of images of specimens linked to DNA sequences (see gallery in GBIF), and presumably reliably identified (by the barcodes). So, could we use the DNA images to build models to identify specimens? Is it possible to use models already trained on citizen science data, or do we need custom models trained on specimens? Can models trained on museum specimens be used to identify living specimens?
To answer this we’ve started simple, using the iNaturalist 2018 competition as a starting point. There is a code in GitHub for an entry in that challenge, and the challenge data is available, so the idea was to take that code and model and see how well it works on DNA barcode images.
That was the plan. I ran into a slew of Python-related issues involving out of date code, dependencies, and issues with running on a MacBook. Python is, well, a mess. I know there are ways to “tame” the mess, but I’m amazed that anyone can get anything done in machine learning given how temperamental the tools are.
Another consideration is that machine learning is computationally intensive, and typically uses PC with NVIDIA chips. Macs don 't have these chips. However, Apple’s newer Macs provide Metal Performance Shaders (MPS) which does speed things up. But getting everything to work together was a nightmare. This is a field full of obscure incantations, bugs, and fixes. I describe some of the things I went through in the README for the repository. Note that this code is really a safety net. Maja is working on a more recent model (using Google’s Colab), I just wanted to make sure that we had a backup in place in case my notion that this ML stuff would be “easy” turned out to be, um, wrong.
Long story short, everything now works. Because our focus is Lepidoptera (moths and butterflies) I ended up subsetting the original challenge dataset to include just those taxa. This resulted in 1234 species. This is obviously a small number, but it means we can train a reasonable model in less than a week (ML is really, really, computationally expensive).
There is still lots to do, but I want to share a small result. After training the model on Lepidoptera from the iNaturalist 2018 dataset, I ran a small number of images from the DNA barcode dataset. The results are encouraging. For example, for Junonia villida all the barcoded specimens were either correctly identified (green) or were in the top three hits (orange) (the code works is it outputs the top three hits for each image). So a model trained on citizen science images of (mostly) living specimens can identify museum specimens.
For other species the results are not so great, but are still interesting. For example, for Junonia orithya quite a few images are not correctly identified (red). Looking at the images, it looks like specimens photographed ventrally are going to be a problem (unlikely to be common angle for photographs of living specimens), and specimens with scale grids and QR codes are unlikely to be seen in the wild(!).
An obvious thing to do would be to train a model based on DNA barcode specimens and see how well it identifies citizen science images (and Maja will be doing just that). If that works well, then that would suggest that there is scope for expanding models for identifying live insects to include museum specimen images (and visa versa), see also Towards a digital natural history museum.
It is early days, still lots of work to do, and deadlines are pressing, but I’m looking forward to seeing how Maja’s project evolves. Perhaps the pain of Python, PyTorch, MPS, etc. will all be worth it.
Written with StackEdit.