iPhylo

Roderic D. M. Page

Thursday, June 05, 2025

A metabarcoding mess and the importance of just looking at the data

How to cite: Page, R. (2025). A metabarcoding mess and the importance of just looking at the data. https://doi.org/10.59350/q2v8n-wc488

Here I summarise a few posts on Bluesky where I raised concerns about some metadabarcoding datasets that were highlighted by GBIF:

Looking at these datasets it’s clear that something is wrong.

Data

The datasets discussed are for CO1 Amplicon Sequence Variants from Madagascar, which are part of the Insect Biome Atlas project. The data is described in Miraldo et al. https://doi.org/10.1038/s41597-025-05151-0. There are two datasets for Madagascar:

CO1 Amplicon Sequence Variants of leaf litter arthropod communities collected at Malaise traps from the Insect Biome Atlas project in Madagascar https://doi.org/10.15468/pad7pc
CO1 Amplicon Sequence Variants of bulk arthropod samples (mild lysis) collected with Malaise traps from the Insect Biome Atlas project in Madagascar https://doi.org/10.15468/6u5rum

In case the data changes in the future I’ve made snapshots of the two datasets and uploaded them to Zenodo doi:10.5281/zenodo.15599342. The files I downloaded (https://doi.org/10.15468/dl.kwjyjt and https://doi.org/10.15468/dl.2p3z5q) are the GBIF annotated archives, hence they include the mapping between the taxonomic names and GBIF’s backbone taxonomy.

Problem

In browsing the data on GBIF I noticed some striking distribution patterns: insects normally found in Europe and/or North America were also turning up in Madagascar, based solely on these metabarcoding datasets. For example, Helina impuncta.

Helina impuncta

Metadata barcoding data can be a complicated beast, especially if you try and navigate the multiple databases that house metadata on the sampling program and the output of sequencing machines. For example, GBIF occurrence 5162479277 is linked to ENA record ERR12944764 which in turn has multiple identifier links:

Study Accession	Sample Accession	Experiment Accession	Run Accession	Tax Id
PRJEB61109	SAMEA115499645	ERX12317105	ERR12944764	1234904

What’s nice about the GBIF datasets that they wrap all this up into a single package that we can explore. BLASTing a few sequences in these datasets suggests that the identifications of these sequences were probably correct, so the source of the problematic maps lies elsewhere.

Lots of maps

I wrote a simple PHP script to read the GBIF dataset, aggregate the GBIF taxon ids (i.e., the GBIF taxa that the sequences were mapped to) and draw a map for each taxon (code is on GitHub) . These maps use GBIF’s maps API to retrieve a tile (256 x 256 pixels) showing the distribution of each taxon on a global map (i.e., zoom level 0 on a tiled web map). I overlay that on a GBIF base map tile (see Base Map Tiles), and dump the output as HTML.

This is crude but gives a quick visual overview of the data. For the litter datasets there are a lot of these Euro-Madagascar distributions:

litter

For the malaise trap data the results look much more like what I’d expect, lots of taxa restricted to Madagascar.

malaise

But there are still examples of the problematic pattern mentioned above.

What happened?

In the paper describing the data there is a paragraph discussing contamination:

The paper goes on to discuss possible examples of contamination. Looking at the results I suspect there has been a lot more contamination than the authors allow, especially for the litter dataset.

Summary

These results are preliminary, and I’ve contacted the authors of the paper to see if we can find out what happened. But for me the most obvious conclusions are:

Metabarcoding has the potential to generate a lot of spurious records that may negatively impact databases such as GBIF.
One of the great features of GBIF is that it enables you to simply look at the data. In an age of automated pipelines and big data I think visualisation is increasingly important. It’s often an easy way to discover that something is not as it should be.

References

Miraldo, A., Sundh, J., Iwaszkiewicz-Eggebrecht, E. et al. Data of the Insect Biome Atlas: a metabarcoding survey of the terrestrial arthropods of Sweden and Madagascar. Sci Data 12, 835 (2025). https://doi.org/10.1038/s41597-025-05151-0

Page, R. (2025). Snapshot of Insect Biome Atlas data for Madagascar from GBIF [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15599342

Friday, May 16, 2025

Tracking changes in DNA barcode BINs

Following on from releasing BOLD View I’ve started to explore how the classifcation of DNA barcodes changes over time. BOLD uses the RESL algorithm described in Ratnasingham & Hebert (2013, 2016) to cluster barcodes into “BINs”. As the number of DNA barcodes grows over time these clusters may change. For example, some clusters may increase in size as barcodes are added, and some clusters may be merged as sequences of intermediate similarity are found that link those BINs. Within the public-facing BOLD portal there is no way to see the history of a BIN (Meier et al., 2022), so I decided to explore this. I downloaded of data packages from BOLD for the period 2022-2024, as well as the BARCODE 500K data for 2016. BOLD issues regular releases of its data, querterly releases are persistent and received a DOI. More regular releases don’t get a DOI and seem to disappear from the web site, but I have a copy of the release for 06-Sep-2024, which I used to create BOLD View.

The data packages I’ve used to infer version history are listed below.

Dataset	DOI
iBOLD.31-Dec-2016	10.5883/dp-ibold.31-dec-2016
BOLD_Public.30-Mar-2022	10.5883/dp-bold_public.30-mar-2022
BOLD_Public.06-Jul-2022	10.5883/dp-bold_public.06-jul-2022
BOLD_Public.28-Sep-2022	10.5883/dp-bold_public.28-sep-2022
BOLD_Public.30-Dec-2022	10.5883/dp-bold_public.30-dec-2022
BOLD_Public.31-Mar-2023	10.5883/dp-bold_public.31-mar-2023
BOLD_Public.30-Jun-2023	10.5883/dp-bold_public.30-jun-2023
BOLD_Public.29-Sep-2023	10.5883/dp-bold_public.29-sep-2023
BOLD_Public.29-Dec-2023	10.5883/dp-bold_public.29-dec-2023
BOLD_Public.29-Mar-2024	10.5883/dp-bold_public.29-mar-2024
BOLD_Public.19-Jul-2024	10.5883/dp-bold_public.19-jul-2024
BOLD_Public.06-Sep-2024	no DOI

Versioning

I am only interested in a few of the fields in the data, namely ,bin_uri, identification, identification_method, and identified_by. Note that field names can change between data packages, so we may have to translate field names, or assemble a field’s value from other fields (e.g., taxonomic classification). Rather than store all the data I used Tuple-versioning , so that we store values for processid and the various data fields, together values for valid_from and valid_to. The first time a combination of values is found we set valid_from to the YYYY-MM-DD date of the corresponding data package, and valid_to to NULL. Note that we may have multiple barcodes for a given processid (e.g., for different genes) so we index on both processid and marker_code. We also compute a MD5 hash of the data for a barcode to enable fast lookup of a particular set of values. The hash is not sufficient to identify an edit as the same set of values may have more than one period of validity. For example, a barcode may be in one BIN, then move to another, then move back again.

When we load the first data package (iBOLD.31-Dec-2016) all rows in the database will have NULL values for valid_to. This signals that those values for the data are currently valid. We then add the remaining data packages from oldest to most recent. For each barcode, if the data for a barcode in the current package is the same as that already in the database (i.e., for which valid_to is NULL) we do nothing. But if the data has changed we do the following:

set valid_to for the most recent row to the YYYY-MM-DD data of the current data package
add a new row with valid_from set to the same date, and valid_to set to NULL.

At the end of this process we have a list of values for the selected fields for each barcode, together with the time span that those values were valid.

Queries

There are two kinds of queries I’ve explored so far. The first is tracking the changes for an individual barcode, the other is the history of a BIN.

Barcode histories

Here is the history for XAF587-05

2022-03-30 - 2022-09-28

identification: Poanes hobomok
identified_by: Paul Hebert

2022-09-28 - 2024-07-19

identification: Lon hobomok
identified_by: Paul Hebert

2024-07-19 -

identification: Lon hobomok
identified_by: Paul D.N. Hebert

This examples shows that we need to be careful when counting edits to a barcode. We could simply record these as changes in identification and identifier, but is a little more complicated. Poanes hobomok and Lon hobomok are synonyms (Cong et al., 2019), so we’ve not changed the taxonomic identification, merely the name. In the absence of a single authoritative source of taxonomic names and synonyms I use TAXMATCH-like rules to “stem” the species names (Boyle, 2013), so that if two values of identification have the same species epithet (taking into account possible change in gender of the genus name) I treat these as changes in name, not identification. The other change is from “Paul Hebert” to “Paul D.N. Hebert”, which is clearly the same person. I compute the Levenshtein distance between values of identified_by and treat any value > 5 as a different name (5 was chosen so that “Paul Hebert” to “Paul D.N. Hebert” would be the same).

BINs

For BINs reconstruct the history by taking a BIN and finding all barcodes that have, at any point in time, been a member of that BIN. So far the best way I’ve come with to visualise the changes in a BIN is to create a “storyline” (see Liu et al., 2013) where the composition of each BIN is shown at each timeslice.

For example, here is the history of BIN BOLD:ABX0491 which contains barcocdes identifiers as Rhamma, Rhamma anosma, and Rhamma bilix (Prieto, et al. 2021).

The vertical columns are time slices, barcodes in the same BIN are grouped together in coloured rectangles, and the history of each barcode can be traced from left to right. You can see cases where barcodes have moved between BINs (BOLD:ABX0491 gobbled up two smaller BINs). There are also barcodes that were (for one time slice) not in any BIN.

This visualisation has been challenging to create, I ended up using # Graphviz as implememted in (https://dreampuf.github.io/GraphvizOnline).

Summary

This is still early stages, but it looks promising. The next step would be to incorporate it into BOLD View. It might also be interetsing to develop measures of stability of barcode clustering based on how often members move around.

References

Boyle, B., Hopkins, N., Lu, Z., Raygoza Garay, J. A., Mozzherin, D., Rees, T., Matasci, N., Narro, M. L., Piel, W. H., Mckay, S. J., Lowry, S., Freeland, C., Peet, R. K., & Enquist, B. J. (2013). The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC Bioinformatics, 14(1). https://doi.org/10.1186/1471-2105-14-16
Cong, Q., Zhang, J., Shen, J., & Grishin, N. V. (2019). Fifty new genera of Hesperiidae (Lepidoptera). Insecta Mundi, 2019, 0731. https://doi.org/10.5281/zenodo.3677235
Hebert, P., & Ratnasingham, S. (2016). Systems, methods, and computer program products for merging a new nucleotide or amino acid sequence into operational taxonomic units (United States Patent US20160103958A1). [https://patents.google.com/patent/US20160103958A1)
Liu, S., Wu, Y., Wei, E., Liu, M., & Liu, Y. (2013). StoryFlow: Tracking the Evolution of Stories. IEEE Transactions on Visualization and Computer Graphics, 19(12), 2436–2445. https://doi.org/10.1109/TVCG.2013.196
Meier, R., Blaimer, B.B., Buenaventura, E., Hartop, E., von Rintelen, T., Srivathsan, A. and Yeo, D. (2022), A re-analysis of the data in Sharkey et al.’s (2021) minimalist revision reveals that BINs do not deserve names, but BOLD Systems needs a stronger commitment to open science. Cladistics, 38: 264-275. https://doi.org/10.1111/cla.12489
Prieto, C., Faynel, C., Robbins, R., & Hausmann, A. (2021). Congruence between morphology-based species and Barcode Index Numbers (BINs) in Neotropical Eumaeini (Lycaenidae). PeerJ, 9, e11843. https://doi.org/10.7717/peerj.11843
Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLOS ONE, 8(7), e66213. https://doi.org/10.1371/journal.pone.0066213

Friday, April 11, 2025

Future interfaces for the Biodiversity Heritage Library

On Wednesday this week (April 9th, 2025) I gave a talk entitled “Future interface(s) for BHL” (the slides are on FigShare) at BHL Day 2025. My goal was to introduce “BHL-Light”, an exploration of an alternative interface to the Biodiversity Heritage Library (BHL). As some readers may already know, BHL is coming to a crossroads, and so this presentation felt a bit more urgent than my usual “here’s yet another web site I made”.

BHL-Light

BHL-Light is my attempt to explore other ways of navigating BHL. The current interface is somewhat dated, and I wanted to start from scratch and see what might be possible to create, even for someone with my somewhat limited skills. BHL-Light has only a very small subset of BHL’s content, I’m putting scalability issues to one side so that I can have some fun.

The tech (TL;DR BHL was not harmed in the making of this)

Under the hood, BHL-Light stores BHL metadata, OCR text, and layout information as JSON documents in CouchDB (one of my favourite databases for exploring new ideas).

BHL serves its images from Internet Archive, which is not always available. BHL recently uploaded images to AWS, but the images there are not currently viewable on the web. So I ended up creating my own image server. I used Hetzner’s S3-compatible object storage for the image files, added imgproxy to resize images as needed, and finally put all this behind a Cloudflare CDN to speed up image delivery (and reduce traffic to the S3 store, which becomes a real consideration when one is paying for all of this).

To view BHL content (e.g., books, journal volumes) I wrote my own viewer, modelled loosely on Google Books. I expressly wanted to avoid IIIF because I find IIIF viewers a terrible way to view documents, and for me BHL is all about the text.

The web site itself is a few PHP scripts to glue everything together, and I’ve tried to avoid using Javascript unless absolutely necessary. HTML + CSS is really powerful these days, so you can do a lot without resorting to Javascript.

Tour

In building BHL-Light I’ve wanted a simple interface to concentrate on displaying content as much as possible. I also wanted a cleaner interface, one that is responsive (AKA "mobile friendly").

BHL has some extraordinary content. It has works both old and new.

Text

The viewer I built makes it easy to scroll through an item, and also makes text selectable (something you currently can’t do in BHL. This means you can interact with text in the browser, such as using Google Chrome to translate part of the text.

It also opens up the possibility of annotation using Hypothes.is.

Geotagging and maps

I also demonstrated pages that had been geotagged. These tags can be extracted and used to create an interactive map.

I still haven’t decided on the best way to interact with the map. For example, should we use the map to search for content geographically, or should we search for content and display those results on a map, or both? I ran out of time to resolve this, so for now if you click on the map you see a H3 hexagon that encloses where you click. The idea is that then the page would display BHL content within that area. Other idea ideas include something like Frankenplace or JournalMap.

Document layout

For me one of the most exciting areas for the future is adding document layout information to BHL content, such that not only can we identify articles, but figures, tables, references, etc. In this way BHL could finally offer something akin to what Plazi can deliver: structured text about species. This has seemed a challenging task, but recent AI developments have been a game changer. In particular, Datalab have released powerful and simple-to-use tools that do a very good job of retrieving document structure from scanned pages. I have started to use this on BHL content and display the results on BHL-Light. For example, Datalab makes it almost trivial to identify and extract figures from scanned pages. Below is a comparsion of document layout for a page as inferred from a born-digital PDF by Plazi, and the same page in BHL where it is simply an image, but Datalab's methods have inferred which bits are text, figures, captions, etc.

One unexpected consequence of building my own image server (see above) is that the task of displaying figures by cropping page images becomes almost trivial. This idea was inspired in part by Smits et al.’s approach of cropping Internet Archive images.

What’s next?

There is much to do. BHL-Light is missing many features. It doesn’t make it easy to find content such as articles found by BioStor, the project I started over a decade ago to find articles in BHL. Search is rudimentary at best, and I haven’t tackled taxonomic names yet (but have ideas for this).

For me BHL-Light is a fun way to explore BHL, and its development has made me even more aware of all the work done to create the current and maintain the BHL portal. Apart from being a play thing for me, I am curious as to whether BHL-Light might be a way to have “BHL-mini” portals, rather like GBIF hosted portals. In this way, we could have views of BHL focused on a particular taxon, institution, person, etc., or localised by language and/or country. Perhaps we could de-extinct past projects such as BHL-Europe?

References

Page, R.D. Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics 12, 187 (2011). https://doi.org/10.1186/1471-2105-12-187

Page, Roderic (2025). Future interface(s) for BHL. figshare. Presentation. https://doi.org/10.6084/m9.figshare.28777868.v1

Smits, T., Warner, B., Fyfe, P., & Lee, B. C. G. (2025). A Fully-Searchable Multimodal Dataset of the Illustrated London News, 1842–1890. Journal of Open Humanities Data, 11(1), 10. https://doi.org/10.5334/johd.284

Wednesday, February 26, 2025

BOLD View: exploring DNA barcodes

For a while now I’ve been exploring ways to navigate through DNA barcodes. Over the years I’ve built various “toys” to explore barcodes, such as Displaying a million DNA barcodes on Google Maps using CouchDB, built a small scale browser using Elastic search that had some succes, and discovered that Postgres can search for DNA sequences and it’s really fast. At the same time, I’ve bemoaned the challenges of getting barcode data into GBIF, and the current state of BOLD’s data exports.

Over the last few months I’ve been getting a project to the point where it’s usable, and today I’ve released a live version called BOLD view. Why make a portal to DNA barcodes when BOLD have themselves recently released a new version of their own portal you might ask? There are two reasons. Making my own forces me to explore the barcode data in some detail, which is eye-opening in places. The second reason is that I want to be able to explore the barcode data at various levels and in different ways. For example, I want an interactive global map of barcodes.

I want to see a DNA barcode in context, including a phylogeny that includes barcodes both within and outside the BIN the barcode belongs too.

I want to make the imagery more visible.

I want to be able to navigate the taxonomy underlying the barcodes using tools such as summary trees.

I want to be able to input a DNA search and quickly search for matches.

I also want to be able to connect the barcodes to the science behind them (who created the barcodes and what questions were they addressing?).

Abve all, I just want to be able to explore the data. I don’t want donut charts and dashboards. I want to be able to see the data and the connections. There is still much to be done, in particular I want to visualise sequence alignments. We can have a global map, and a global taxonomy, where is the global alignment?

I hope to work on BOLD view further, but for now it is out the door and my spotlight will inevitably turn elsewhere.

Tuesday, October 29, 2024

Internet Archive as a single point of failure

How to cite: Page, R. (2024). Internet Archive as a single point of failure https://doi.org/10.59350/1r3m1-c5e22

Just a placeholder to mark the ongoing impact of the Internet Archive being attacked (see here, here and here for details).

The impact of this on the Biodiversity Heritage Library (BHL) has been huge, and reveals the extent to which BHL depends on the Archive. The Archive is:

BHL’s long-term archival storage of book scans
BHL’s processing pipeline for converting images to text
BHL’s store for additional metadata (e.g., page numbers)
BHL’s image server (i.e., all the images of scanned books on the BHL website are served from the Archive)

The attack on the Archive has crippled BHL (parts are slowly coming back). I think this is time for a fundamental rethink in how BHL manages its data, its processing pipeline, and how it serves images.

Friday, October 18, 2024

Exploring BOLD's DNA barcode data releases: there's a fraction too much friction

How to cite: Page, R. (2024). Exploring BOLD's DNA barcode data releases: there's a fraction too much friction https://doi.org/10.59350/6qepn-ge510

Recently I’ve been exploring data downloaded from BOLD. Part of this was motivated by work done with David Schindel for a recent book:

In this blog post I record some struggles I’ve had with the supposedly “Frictionless” data provided by BOLD. I list a serious of issues, and make some recommendations as to how these can be fixed.

Previous versions disappear from site

The web page Data Packages lists datasets that can be downloaded.

The two most recent have the DOIs:

While this makes it easy to link to the latest version of the data, it inhibits reproducibility because the data doi:10.5883/DP-Latest points to can change, and there is (currently) no unique DOI for that particular dataset. Once a version becomes the third oldest, then it seems to get a version-specific DOI.

The list of “Historical Data” releases is not exhaustive. Currently (2024-10-17) there are four older versions listed:

However I have downloaded older versions that are no longer listed on the Data Packages web page. This makes it hard for anyone wanting to trace the history of changes in BOLD data.

Recommendation

Have distinct DOIs for latest version (i.e., in addition to “10.5883/DP-Latest”), and keep list of all releases on the web site.

Even better, switch to using Zenodo to store data, they provide a nicer model of versioning, and can also provide download metrics.

Where are the images?

A major surprise is the lack of URLs for specimen images. The imagery in BOLD is very useful, yet not included in the data export! The only way to get a list of image URLs is to get the data from GBIF(!).

Recommendation

Include images URLs in the data releases.

Column names change over time and aren’t standardised

One major source of frustration is that the labels used for the columns of data can change. This is, pun intended, a major source of friction in supposedly “frictionless” data. Code that successfully parses one dataset may fail with the new release. One could argue that the code should rely solely on the information in the data package, but there are some columns (e.g., geographic coordinates, the raw sequences) that require special treatment, and the natural language descriptions of each column are not machine-readable.

Below is a table of column names from a range of data packages from 2022-09-28 to 2024-09-06. Most column names are stable, but sometimes new ones appear, and sometimes they vanish. Core data elements, such as nucleotide sequences can change, e.g. nuc versus nucraw.

column name	2022-09-28	2023-09-29	2023-10-27	2024-07-26	2024-08-02	2024-08-09	2024-09-06
associated_specimen
associated_specimens
associated_taxa
bin_created_date
bin_uri
biome
bold_recordset_code_arr
class
collection_code
collection_date
collection_date_accuracy
collection_date_end
collection_date_start
collection_event_id
collection_note
collection_notes
collection_time
collectors
coord
coord_accuracy
coord_source
country
country/ocean
country_iso
depth
depth_accuracy
ecoregion
elev
elev_accuracy
extrainfo
family
fieldid
funding_src
gb_acs
genus
geoid
habitat
identification
identification_method
identification_rank
identified_by
identifier_email
insdc_acs
inst
kingdom
life_stage
marker_code
museumid
notes
nuc
nuc_basecount
nucraw
order
phylum
primers_forward
primers_reverse
processid
processid_minted_date
province
realm
record_id
recordset_code_arr
region
reproduction
sampleid
sampling_protocol
sector
sequence_run_site
sequence_upload_date
sex
short_note
site
site_code
sovereign_inst
species
species_reference
specimen_linkout
specimenid
subfamily
subspecies
taxid
taxon_name
taxon_rank
taxonomy_notes
tissue_type
tribe
voucher_type

Recomendation

Avoid changing the names of data columns between releases. Adopt standardised terms, such as Darwin Core, wherever possible. Tell us that these are Darwin Core by using the dwc: prefix.

Use identifiers for people

People appear in several places in the data, notably as identifiers and collectors. The BOLD data uses simple text strings (i.e., names) of people, rather than external identifiers such as ORCIDs. This means we miss out on valuable information. For example, for a 2 million sequence subset of the latest release I was curious as to who identified the most specimens. For each of these names I then wanted to find out who they were. For example, are they taxonomists? If so, what is their expertise? What taxonomic papers have they published? Where are they based? If BOLD included ORCID ids it would be easier to answer these questions. Instead, I resorted to Google and produced the following table:

Top 10 identifiers of BOLD specimens:

Name	Affiliation	ORCID
Kate Perez	U of Guelph	0000-0001-5233-1539
Angela Telfer	U of Guelph	0000-0003-1846-6362
Daniel H. Janzen	U of Pennsylvania	0000-0002-7335-5107
Valerie Levesque-Beaudin	U of Guelph	0000-0002-6053-0949
Gergin A. Blagoev	U of Guelph	0000-0003-1844-0779
Renee Miskie	U of Guelph	-
BOLD ID Engine	-	-
Brandon MONG Guo jie	Academia Sinica, Taipei	0000-0002-1673-8021
Paul D.N. Hebert	U of Guelph	0000-0002-3081-6700
Brian Fisher	California Academy of Sciences	0000-0002-4653-3270

Note that most of the top ten identifiers work at the University of Guelph, the home of BOLD. This tells us something about the degree to which BOLD is dependent on its own staff to identify specimens, versus the extent to which it has engaged the wider community.

The flip side of this is that these people are curating an important database. Are they getting credit? Is this curation making its way into Bionomia, which has mechanisms to give credit to this work.

I have also briefly looked at collector names, and it is - as one might expect - something of a mess. The same person’s name is written different ways, text strings representing multiple people are incorrectly split into individual names, etc. The description in the data package is more wishful thinking than reality:

Recommendation

Add ORCID identifiers for people who have identified specimens.

Method of identification not standardised

The value of BOLD as a tool for identifying new sequences depends on the reliability of existing DNA barcodes. How are these identified? The field identification_method is full of a mix of terms. There are all the obvious traps people fall into when not being careful with data. The same term may be spelt differently and/or is capitalised differently. People add qualifiers to a term, such as the date of identification, making it much harder to ask simple questions such as how many sequences have been identified based on their morphology, versus based on their sequences.

So far I’ve found 1889 different terms for identification method, here are the top 20:

BIN Taxonomy Match
BOLD ID Engine Manual
BOLD Sequence Classifier
Morphology
morphology
morphological
BOLD ID Engine (March 2015)
Tree based Identification(April 2016)
Morphological
BIN Taxonomy Match (Mar 2023)
BIN Taxonomy Match (May 2019)
Tree based Identification (Feb 2017)
BIN Taxonomy Match (May 2017)
BIN Taxonomy Match (Oct 2022)
BIN Taxonomy Match (Aug 2023)
BIN Taxonomy Match (Mar 2017)
BOLD ID Engine
BIN Taxonomy Match (Apr 2017)
BIN Taxonomy Match (Jun 2019)
Tree Based Identification (April 2016)

Note that we have “Morphology”, “morphology”, “morphological”, and “Morphological”. How is “BOLD ID Engine Manual” different from “BOLD ID Engine”? Note the use of dates as qualifiers.

Recommendation

Enforce a standardised vocabulary, add additional fields for date and notes on identification.

Voucher type not standardised

The description of the voucher_type reads:

This is patently false. Instead of seven terms are at least 508 for this field. Here are the top 20 terms (* indicates a term from the controlled vocabulary):

Vouchered:Registered Collection*
DNA/Tissue Vouchered Only*
To Be Vouchered:Holdup/Private*
museum voucher
E-Vouchered:DNA/Tissue+Photo*
Voucher Type: Morphological
No Specimen*
Museum voucher, whole specimen in ethanol
Vouchered:Private Collection
Museum Vouchered:Type*
Museum Vouchered:Type Series*
Museum voucher, Whole specimen in ethanol
Museum voucher, whole specimen
Museum Vouchered
Museum voucher, e-voucher
Museum voucher, Whole specimen
Museum voucher, E-vouchered with additional representatives stored in ethanol in parent lot
vouchered: not registered collection
Museum voucher, E-vouchered with additional representatives stored in ethanol
in alcohol (ethanol, 96%)

Recommendation

Enforce the existing controlled vocabulary.

Institutions lack identifiers

Institutions are listed by name. As with any string, there is the potential for different spellings and formatting. Anyone interested in getting metrics for institutional engagement with BOLD, and comparing that to, say, sources of funding, would much rather have identifiers than strings.

Here are the top 20 institutions:

Centre for Biodiversity Genomics
University of Pennsylvania
Area de Conservacion Guanacaste
Mined from GenBank, NCBI
Canadian National Collection of Insects, Arachnids and Nematodes
SNSB, Zoologische Staatssammlung Muenchen
Australian National Insect Collection
University of Malaya, Museum of Zoology
Instituto Nacional de Biodiversidad, Costa Rica
Royal Ontario Museum
California Academy of Sciences
NEON Biorepository at Arizona State University
Research Collection of M. Alex Smith
Smithsonian Tropical Research Institute
University of New Brunswick, Fredericton
York University, Packer Collection
Wellcome Sanger Institute
Smithsonian Institution, National Museum of Natural History
Natural History Museum, London
University of Oulu, Zoological Museum

Recommendation

Add external identifiers for institutions such as RORs.

Summary

Some of these issues raised here are easy to fix, other will require a lot of curation. I suspect that part of the problem is that there’s no evidence that BOLD itself makes use of these data dumps. If you view data exports as somethign you are “supposed to” do, rather than something that you yourself use, then there’s no incentive to make sure the data is fit for purpose. Eating your own dog food is a great way to avoid these problems.

Tuesday, October 08, 2024

The Data Citation Corpus revisited

How to cite: Page, R. (2024). The Data Citation Corpus revisited https://doi.org/10.59350/wvwva-v7125

TL;DR

These are some brief notes on the latest version (v. 2) of the Data Citation Corpus, relased shortly before the Make Data Count Summit 2024, which also included a discussion on the practical uses of the corpus.

I downloaded version 2 from Zenodo doi:10.5281/zenodo.13376773. The data is in JSON format, which I then loaded into CouchDB to play with. Loading the data was relatively quick using CouchDB’s “bulk upload” feature, although building indexes to explore the data takes a little while.

What follows is a series of charts constructed using Vega-Lite.

The top 20 repositories

Top 20 repositories in Data Citation Corpus

The chart above shows the top 20 repositories by number of citations. The biggest repository by some distance is the European Nucleotide Archive, so most of the data being cited are DNA sequences. Those working on biodiversity might be plessed to see that GBIF is 16th out of 20.

Note that Figshare appears twice, as “Figshare” and “figshare”, so there are problems with data cleaning. It’s actually worse than this, because the repository “Taylor & Francis” is an branded instance of Figshare: https://tandf.figshare.com. So if we want to measure the impact of the Figshare repository we will need to cluster the different spellings, as well as check the DOIs for each repository (T&F data DOIs still carry Figshare branding, e.g.doi:10.6084/m9.figshare.14103264.

The vast majority of data in the “Taylor & Francis” is published by Informa UK Ltd which is the parent company of Taylor & Francis. If you visit an article in an Informa journal, such as Enhanced Selectivity of Ultraviolet-Visible Absorption Spectroscopy with Trilinear Decomposition on Spectral pH Measurements for the Interference-Free Determination of Rutin and Isorhamnetin in Chinese Herbal Medicine the supplementary information is stored in Figshare. These links between the publication and the supplementary information are treated as “citations” in the corpus. Is this what we mean by citation? If so, we are not measuring the use of data, but rather its publication.

Top 20 publishers

Top 20 publishers in Data Citation Corpus

The top 20 publishers of articles that cite data in the corpus are shown above. It would be interesting to know how much this reflects publication policies of these publishers (e.g., open versus closed access, indexing in Pubmed, availability of XML, etc.) versus actual citation of data. Biodiversity people might be pleased to see Pensoft appearing in the top 20.

GBIF

You can also explore the data by individual repository. For example, the top 20 publishers of articles citing data in GBIF shows Pensoft at number one. This reflects the subject matter of Pensoft journals, and Pensoft’s focus on best practices for publishing data. From GBIF’s perspective, perhaps that organisation would want to extend their reach beyond core biodiversity journals.

Top 20 publishers citing data from GBIF

Citation

The vast majority of data in the Data Citation Corpus is cited only once.

Frequency of citation

Given that much of these “citations” may be by the publication that makes the data available, it’s not clear to me that the corpus is actually measuring citation (i.e., reuse of the data). Instead it may just be measuring publication (e.g., the link betwene a paper and its supplementary data). To answer this we’d need to drill down into the data more.

Note that some data items have large numbers of citations, the highest is “LY294002” with 9983 citations, with the next being “A549” with 5883 citations. LY294002 is a chemical compound that acts as an inhibitor, and A549 is cell type. The citation corpus regards both as accession numbers for sequences(!). Hence it’s likely that the most cited data records are not data at all, but false matches to other entities, such as chemicals and cells. These false matches are still a major problem for the corpus.

Summary

I think there is so much potential here, but there are significant data quality issues. Anyone basing metrics upon this corpus would need to proceed very carefully.