iPhylo: Sub-second searching of millions of DNA barcodes using a vector database

Roderic D. M. Page

Friday, July 28, 2023

Sub-second searching of millions of DNA barcodes using a vector database

How to cite: Page, R. (2023). Sub-second searching of millions of DNA barcodes using a vector database. https://doi.org/10.59350/qkn8x-mgz20

Recently I’ve been messing about with DNA barcodes. I’m junior author with David Schindel on forthcoming book chapter Creating Virtuous Cycles for DNA Barcoding: A Case Study in Science Innovation, Entrepreneurship, and Diplomacy, and I’ve blogged about Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera. One thing I’ve always wanted is a simple way to explore DNA barcodes both geographically and phylogenetically. I’ve made various toys (e.g., Notes on next steps for the million DNA barcodes map and DNA barcode browser) but one big challenge has been search.

The goal is to be able to do is take a DNA sequence and search the DNA barcode database for barcodes that are similar to that sequence, then build a phylogenetic tree for the results. And I want this to be fast. The approach I used in my :“DNA barcode browser” was to use Elasticsearch and index the DNA sequences as n-grams (=k-mers). This worked well for small numbers of sequences, but when I tried this for millions of sequences things got very slow, typically it took around eight seconds for a search to complete. This is about the same as BLAST on my laptop for the same dataset. These sort of search times are simply too slow, hence I put this work on the back burner. That is, until I started exploring vector databvases.

Vector databases, as the name suggests, store vectors, that is, arrays of numbers. Many of the AI sites currently gaining attention use vector databases. For example, chat bots based on ChatGPT are typically taking text, converting it to an “embedding” (a vector), then searching in a database for similar vectors which, hopefully, correspond to documents that are related to the original query (see ChatGPT, semantic search, and knowledge graphs).

The key step is to convert the thing you are interested in (e.g., text, or an image) into an embedding, which is a vector of fixed length that encodes information about the thing. In the case of DNA sequences one way to do this is to use k-mers. These are short, overlapping fragments of the DNA sequence (see This is what phylodiversity looks like). In the case of k-mers of length 5 the embedding is a vector of the frequencies of the 4⁵ = 1,024 different k-mers for the letters A, C, G, and T.

But what do we do with these vectors? This is where the vector database comes in. Search in a vector database is essentially a nearest-neighbour search - we want to find vectors that are similar to our query vector. There has been a lot of cool research on this problem (which is now highly topical because of the burgeoning interest in machine learning), and not only are there vector databases, but tools to add this functionality to existing databases.

So, I decided to experiment. I grabbed a copy of PostgreSQL (not a database I’d used before), added the pgvector extension, then created a database with over 9 million DNA barcodes. After a bit of faffing around, I got it to work (code still needs cleaning up, but I will release something soon).

So far the results are surprisingly good. If I enter a nucleotide sequence, such as JF491468 (Neacomys sp. BOLD:AAA7034 voucher ROM 118791) and search for the 100 most similar sequences I get back 100 Neacomys sequences in 0.14 seconds(!). I can then take the vectors for each of those sequences (i.e., the array of k-mer frequencies), compute a pairwise distance matrix, then build a phylogeny (in PAUP,* naturally).

Searches this rapid mean we can start to interactively explore large databases of DNA barcodes, as well as quickly take new, unknown sequences and ask “have we seen this before?”

As a general tool this approach has limitations. Vector databases have a limit on the size of vector they can handle, so k-mers much larger than 5 will not be feasible (unless the vectors are sparse in the sense that not all k-mers actually occur). Also it’s not clear to me how much this approach succeeds because of the nature of barcode data. Typically barcodes are either very similar to each other (i.e., from the same species), or they are quite different (the famous “barcode gap”). This may have implications for the success of nearest neighbour searching.

Still early days, but so far this has been a revelation, and opens up some interesting possibilities for how we could explore and interact with DNA barcodes.

Written with StackEdit.