Following on from releasing BOLD View I’ve started to explore how the classifcation of DNA barcodes changes over time. BOLD uses the RESL algorithm described in Ratnasingham & Hebert (2013, 2016) to cluster barcodes into “BINs”. As the number of DNA barcodes grows over time these clusters may change. For example, some clusters may increase in size as barcodes are added, and some clusters may be merged as sequences of intermediate similarity are found that link those BINs. Within the public-facing BOLD portal there is no way to see the history of a BIN (Meier et al., 2022), so I decided to explore this. I downloaded of data packages from BOLD for the period 2022-2024, as well as the BARCODE 500K data for 2016. BOLD issues regular releases of its data, querterly releases are persistent and received a DOI. More regular releases don’t get a DOI and seem to disappear from the web site, but I have a copy of the release for 06-Sep-2024, which I used to create BOLD View.
The data packages I’ve used to infer version history are listed below.
Dataset | DOI |
---|---|
iBOLD.31-Dec-2016 | 10.5883/dp-ibold.31-dec-2016 |
BOLD_Public.30-Mar-2022 | 10.5883/dp-bold_public.30-mar-2022 |
BOLD_Public.06-Jul-2022 | 10.5883/dp-bold_public.06-jul-2022 |
BOLD_Public.28-Sep-2022 | 10.5883/dp-bold_public.28-sep-2022 |
BOLD_Public.30-Dec-2022 | 10.5883/dp-bold_public.30-dec-2022 |
BOLD_Public.31-Mar-2023 | 10.5883/dp-bold_public.31-mar-2023 |
BOLD_Public.30-Jun-2023 | 10.5883/dp-bold_public.30-jun-2023 |
BOLD_Public.29-Sep-2023 | 10.5883/dp-bold_public.29-sep-2023 |
BOLD_Public.29-Dec-2023 | 10.5883/dp-bold_public.29-dec-2023 |
BOLD_Public.29-Mar-2024 | 10.5883/dp-bold_public.29-mar-2024 |
BOLD_Public.19-Jul-2024 | 10.5883/dp-bold_public.19-jul-2024 |
BOLD_Public.06-Sep-2024 | no DOI |
Versioning
I am only interested in a few of the fields in the data, namely ,bin_uri
, identification
, identification_method
, and identified_by
. Note that field names can change between data packages, so we may have to translate field names, or assemble a field’s value from other fields (e.g., taxonomic classification). Rather than store all the data I used Tuple-versioning , so that we store values for processid
and the various data fields, together values for valid_from
and valid_to
. The first time a combination of values is found we set valid_from
to the YYYY-MM-DD date of the corresponding data package, and valid_to
to NULL
. Note that we may have multiple barcodes for a given processid
(e.g., for different genes) so we index on both processid
and marker_code
. We also compute a MD5 hash of the data for a barcode to enable fast lookup of a particular set of values. The hash is not sufficient to identify an edit as the same set of values may have more than one period of validity. For example, a barcode may be in one BIN, then move to another, then move back again.
When we load the first data package (iBOLD.31-Dec-2016) all rows in the database will have NULL values for valid_to
. This signals that those values for the data are currently valid. We then add the remaining data packages from oldest to most recent. For each barcode, if the data for a barcode in the current package is the same as that already in the database (i.e., for which valid_to
is NULL
) we do nothing. But if the data has changed we do the following:
- set
valid_to
for the most recent row to the YYYY-MM-DD data of the current data package - add a new row with
valid_from
set to the same date, andvalid_to
set to NULL.
At the end of this process we have a list of values for the selected fields for each barcode, together with the time span that those values were valid.
Queries
There are two kinds of queries I’ve explored so far. The first is tracking the changes for an individual barcode, the other is the history of a BIN.
Barcode histories
Here is the history for XAF587-05
2022-03-30 - 2022-09-28
- identification: Poanes hobomok
- identified_by: Paul Hebert
2022-09-28 - 2024-07-19
- identification: Lon hobomok
- identified_by: Paul Hebert
2024-07-19 -
- identification: Lon hobomok
- identified_by: Paul D.N. Hebert
This examples shows that we need to be careful when counting edits to a barcode. We could simply record these as changes in identification and identifier, but is a little more complicated. Poanes hobomok and Lon hobomok are synonyms (Cong et al., 2019), so we’ve not changed the taxonomic identification, merely the name. In the absence of a single authoritative source of taxonomic names and synonyms I use TAXMATCH-like rules to “stem” the species names (Boyle, 2013), so that if two values of identification
have the same species epithet (taking into account possible change in gender of the genus name) I treat these as changes in name, not identification. The other change is from “Paul Hebert” to “Paul D.N. Hebert”, which is clearly the same person. I compute the Levenshtein distance between values of identified_by
and treat any value > 5 as a different name (5 was chosen so that “Paul Hebert” to “Paul D.N. Hebert” would be the same).
BINs
For BINs reconstruct the history by taking a BIN and finding all barcodes that have, at any point in time, been a member of that BIN. So far the best way I’ve come with to visualise the changes in a BIN is to create a “storyline” (see Liu et al., 2013) where the composition of each BIN is shown at each timeslice.
For example, here is the history of BIN BOLD:ABX0491 which contains barcocdes identifiers as Rhamma, Rhamma anosma, and Rhamma bilix (Prieto, et al. 2021).
The vertical columns are time slices, barcodes in the same BIN are grouped together in coloured rectangles, and the history of each barcode can be traced from left to right. You can see cases where barcodes have moved between BINs (BOLD:ABX0491 gobbled up two smaller BINs). There are also barcodes that were (for one time slice) not in any BIN.
This visualisation has been challenging to create, I ended up using # Graphviz as implememted in (https://dreampuf.github.io/GraphvizOnline).
Summary
This is still early stages, but it looks promising. The next step would be to incorporate it into BOLD View. It might also be interetsing to develop measures of stability of barcode clustering based on how often members move around.
References
- Boyle, B., Hopkins, N., Lu, Z., Raygoza Garay, J. A., Mozzherin, D., Rees, T., Matasci, N., Narro, M. L., Piel, W. H., Mckay, S. J., Lowry, S., Freeland, C., Peet, R. K., & Enquist, B. J. (2013). The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC Bioinformatics, 14(1). https://doi.org/10.1186/1471-2105-14-16
- Cong, Q., Zhang, J., Shen, J., & Grishin, N. V. (2019). Fifty new genera of Hesperiidae (Lepidoptera). Insecta Mundi, 2019, 0731. https://doi.org/10.5281/zenodo.3677235
- Hebert, P., & Ratnasingham, S. (2016). Systems, methods, and computer program products for merging a new nucleotide or amino acid sequence into operational taxonomic units (United States Patent US20160103958A1). [https://patents.google.com/patent/US20160103958A1)
- Liu, S., Wu, Y., Wei, E., Liu, M., & Liu, Y. (2013). StoryFlow: Tracking the Evolution of Stories. IEEE Transactions on Visualization and Computer Graphics, 19(12), 2436–2445. https://doi.org/10.1109/TVCG.2013.196
- Meier, R., Blaimer, B.B., Buenaventura, E., Hartop, E., von Rintelen, T., Srivathsan, A. and Yeo, D. (2022), A re-analysis of the data in Sharkey et al.’s (2021) minimalist revision reveals that BINs do not deserve names, but BOLD Systems needs a stronger commitment to open science. Cladistics, 38: 264-275. https://doi.org/10.1111/cla.12489
- Prieto, C., Faynel, C., Robbins, R., & Hausmann, A. (2021). Congruence between morphology-based species and Barcode Index Numbers (BINs) in Neotropical Eumaeini (Lycaenidae). PeerJ, 9, e11843. https://doi.org/10.7717/peerj.11843
- Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLOS ONE, 8(7), e66213. https://doi.org/10.1371/journal.pone.0066213
Written with StackEdit.