Tuesday, October 29, 2024

Internet Archive as a single point of failure

How to cite: Page, R. (2024). Internet Archive as a single point of failure https://doi.org/10.59350/1r3m1-c5e22

Just a placeholder to mark the ongoing impact of the Internet Archive being attacked (see here, here and here for details).

The impact of this on the Biodiversity Heritage Library (BHL) has been huge, and reveals the extent to which BHL depends on the Archive. The Archive is:

  • BHL’s long-term archival storage of book scans
  • BHL’s processing pipeline for converting images to text
  • BHL’s store for additional metadata (e.g., page numbers)
  • BHL’s image server (i.e., all the images of scanned books on the BHL website are served from the Archive)

The attack on the Archive has crippled BHL (parts are slowly coming back). I think this is time for a fundamental rethink in how BHL manages its data, its processing pipeline, and how it serves images.

Written with StackEdit.

Friday, October 18, 2024

Exploring BOLD's DNA barcode data releases: there's a fraction too much friction

How to cite: Page, R. (2024). Exploring BOLD's DNA barcode data releases: there's a fraction too much friction https://doi.org/10.59350/6qepn-ge510

Recently I’ve been exploring data downloaded from BOLD. Part of this was motivated by work done with David Schindel for a recent book:

Schindel, D.E., Page, R.M.P. (2024). Creating Virtuous Cycles for DNA Barcoding: A Case Study in Science Innovation, Entrepreneurship, and Diplomacy. In: DeSalle, R. (eds) DNA Barcoding. Methods in Molecular Biology, vol 2744. Humana, New York, NY. doi:10.1007/978-1-0716-3581-0_1

In this blog post I record some struggles I’ve had with the supposedly “Frictionless” data provided by BOLD. I list a serious of issues, and make some recommendations as to how these can be fixed.

Previous versions disappear from site

The web page Data Packages lists datasets that can be downloaded.

The two most recent have the DOIs:

While this makes it easy to link to the latest version of the data, it inhibits reproducibility because the data doi:10.5883/DP-Latest points to can change, and there is (currently) no unique DOI for that particular dataset. Once a version becomes the third oldest, then it seems to get a version-specific DOI.

The list of “Historical Data” releases is not exhaustive. Currently (2024-10-17) there are four older versions listed:

However I have downloaded older versions that are no longer listed on the Data Packages web page. This makes it hard for anyone wanting to trace the history of changes in BOLD data.

Recommendation

Have distinct DOIs for latest version (i.e., in addition to “10.5883/DP-Latest”), and keep list of all releases on the web site.

Even better, switch to using Zenodo to store data, they provide a nicer model of versioning, and can also provide download metrics.

Where are the images?

A major surprise is the lack of URLs for specimen images. The imagery in BOLD is very useful, yet not included in the data export! The only way to get a list of image URLs is to get the data from GBIF(!).

Recommendation

Include images URLs in the data releases.

Column names change over time and aren’t standardised

One major source of frustration is that the labels used for the columns of data can change. This is, pun intended, a major source of friction in supposedly “frictionless” data. Code that successfully parses one dataset may fail with the new release. One could argue that the code should rely solely on the information in the data package, but there are some columns (e.g., geographic coordinates, the raw sequences) that require special treatment, and the natural language descriptions of each column are not machine-readable.

Below is a table of column names from a range of data packages from 2022-09-28 to 2024-09-06. Most column names are stable, but sometimes new ones appear, and sometimes they vanish. Core data elements, such as nucleotide sequences can change, e.g. nuc versus nucraw.

column name 2022-09-28 2023-09-29 2023-10-27 2024-07-26 2024-08-02 2024-08-09 2024-09-06
associated_specimen
associated_specimens
associated_taxa
bin_created_date
bin_uri
biome
bold_recordset_code_arr
class
collection_code
collection_date
collection_date_accuracy
collection_date_end
collection_date_start
collection_event_id
collection_note
collection_notes
collection_time
collectors
coord
coord_accuracy
coord_source
country
country/ocean
country_iso
depth
depth_accuracy
ecoregion
elev
elev_accuracy
extrainfo
family
fieldid
funding_src
gb_acs
genus
geoid
habitat
identification
identification_method
identification_rank
identified_by
identifier_email
insdc_acs
inst
kingdom
life_stage
marker_code
museumid
notes
nuc
nuc_basecount
nucraw
order
phylum
primers_forward
primers_reverse
processid
processid_minted_date
province
realm
record_id
recordset_code_arr
region
reproduction
sampleid
sampling_protocol
sector
sequence_run_site
sequence_upload_date
sex
short_note
site
site_code
sovereign_inst
species
species_reference
specimen_linkout
specimenid
subfamily
subspecies
taxid
taxon_name
taxon_rank
taxonomy_notes
tissue_type
tribe
voucher_type

Recomendation

Avoid changing the names of data columns between releases. Adopt standardised terms, such as Darwin Core, wherever possible. Tell us that these are Darwin Core by using the dwc: prefix.

Use identifiers for people

People appear in several places in the data, notably as identifiers and collectors. The BOLD data uses simple text strings (i.e., names) of people, rather than external identifiers such as ORCIDs. This means we miss out on valuable information. For example, for a 2 million sequence subset of the latest release I was curious as to who identified the most specimens. For each of these names I then wanted to find out who they were. For example, are they taxonomists? If so, what is their expertise? What taxonomic papers have they published? Where are they based? If BOLD included ORCID ids it would be easier to answer these questions. Instead, I resorted to Google and produced the following table:

Top 10 identifiers of BOLD specimens:

Name Affiliation ORCID
Kate Perez U of Guelph 0000-0001-5233-1539
Angela Telfer U of Guelph 0000-0003-1846-6362
Daniel H. Janzen U of Pennsylvania 0000-0002-7335-5107
Valerie Levesque-Beaudin U of Guelph 0000-0002-6053-0949
Gergin A. Blagoev U of Guelph 0000-0003-1844-0779
Renee Miskie U of Guelph -
BOLD ID Engine - -
Brandon MONG Guo jie Academia Sinica, Taipei 0000-0002-1673-8021
Paul D.N. Hebert U of Guelph 0000-0002-3081-6700
Brian Fisher California Academy of Sciences 0000-0002-4653-3270

Note that most of the top ten identifiers work at the University of Guelph, the home of BOLD. This tells us something about the degree to which BOLD is dependent on its own staff to identify specimens, versus the extent to which it has engaged the wider community.

The flip side of this is that these people are curating an important database. Are they getting credit? Is this curation making its way into Bionomia, which has mechanisms to give credit to this work.

I have also briefly looked at collector names, and it is - as one might expect - something of a mess. The same person’s name is written different ways, text strings representing multiple people are incorrectly split into individual names, etc. The description in the data package is more wishful thinking than reality:

Comma separated list of full or abbreviated names of the individuals or teams responsible for collecting the sample in the field.

Recommendation

Add ORCID identifiers for people who have identified specimens.

Method of identification not standardised

The value of BOLD as a tool for identifying new sequences depends on the reliability of existing DNA barcodes. How are these identified? The field identification_method is full of a mix of terms. There are all the obvious traps people fall into when not being careful with data. The same term may be spelt differently and/or is capitalised differently. People add qualifiers to a term, such as the date of identification, making it much harder to ask simple questions such as how many sequences have been identified based on their morphology, versus based on their sequences.

So far I’ve found 1889 different terms for identification method, here are the top 20:

  • BIN Taxonomy Match
  • BOLD ID Engine Manual
  • BOLD Sequence Classifier
  • Morphology
  • morphology
  • morphological
  • BOLD ID Engine (March 2015)
  • Tree based Identification(April 2016)
  • Morphological
  • BIN Taxonomy Match (Mar 2023)
  • BIN Taxonomy Match (May 2019)
  • Tree based Identification (Feb 2017)
  • BIN Taxonomy Match (May 2017)
  • BIN Taxonomy Match (Oct 2022)
  • BIN Taxonomy Match (Aug 2023)
  • BIN Taxonomy Match (Mar 2017)
  • BOLD ID Engine
  • BIN Taxonomy Match (Apr 2017)
  • BIN Taxonomy Match (Jun 2019)
  • Tree Based Identification (April 2016)

Note that we have “Morphology”, “morphology”, “morphological”, and “Morphological”. How is “BOLD ID Engine Manual” different from “BOLD ID Engine”? Note the use of dates as qualifiers.

Recommendation

Enforce a standardised vocabulary, add additional fields for date and notes on identification.

Voucher type not standardised

The description of the voucher_type reads:

“Status of the specimen in an accessioning process.This field uses a controlled vocabulary: ‘Museum Vouchered:Type’, ‘Museum Vouchered:Type Series’, ‘Vouchered:Registered Collection’, ‘To Be Vouchered:Holdup/Private’, ‘E-Vouchered:DNA/Tissue+Photo’, ‘Dna/Tissue Vouchered Only’, ‘No Specimen’.”

This is patently false. Instead of seven terms are at least 508 for this field. Here are the top 20 terms (* indicates a term from the controlled vocabulary):

  • Vouchered:Registered Collection*
  • DNA/Tissue Vouchered Only*
  • To Be Vouchered:Holdup/Private*
  • museum voucher
  • E-Vouchered:DNA/Tissue+Photo*
  • Voucher Type: Morphological
  • No Specimen*
  • Museum voucher, whole specimen in ethanol
  • Vouchered:Private Collection
  • Museum Vouchered:Type*
  • Museum Vouchered:Type Series*
  • Museum voucher, Whole specimen in ethanol
  • Museum voucher, whole specimen
  • Museum Vouchered
  • Museum voucher, e-voucher
  • Museum voucher, Whole specimen
  • Museum voucher, E-vouchered with additional representatives stored in ethanol in parent lot
  • vouchered: not registered collection
  • Museum voucher, E-vouchered with additional representatives stored in ethanol
  • in alcohol (ethanol, 96%)

Recommendation

Enforce the existing controlled vocabulary.

Institutions lack identifiers

Institutions are listed by name. As with any string, there is the potential for different spellings and formatting. Anyone interested in getting metrics for institutional engagement with BOLD, and comparing that to, say, sources of funding, would much rather have identifiers than strings.

Here are the top 20 institutions:

  • Centre for Biodiversity Genomics
  • University of Pennsylvania
  • Area de Conservacion Guanacaste
  • Mined from GenBank, NCBI
  • Canadian National Collection of Insects, Arachnids and Nematodes
  • SNSB, Zoologische Staatssammlung Muenchen
  • Australian National Insect Collection
  • University of Malaya, Museum of Zoology
  • Instituto Nacional de Biodiversidad, Costa Rica
  • Royal Ontario Museum
  • California Academy of Sciences
  • NEON Biorepository at Arizona State University
  • Research Collection of M. Alex Smith
  • Smithsonian Tropical Research Institute
  • University of New Brunswick, Fredericton
  • York University, Packer Collection
  • Wellcome Sanger Institute
  • Smithsonian Institution, National Museum of Natural History
  • Natural History Museum, London
  • University of Oulu, Zoological Museum

Recommendation

Add external identifiers for institutions such as RORs.

Summary

Some of these issues raised here are easy to fix, other will require a lot of curation. I suspect that part of the problem is that there’s no evidence that BOLD itself makes use of these data dumps. If you view data exports as somethign you are “supposed to” do, rather than something that you yourself use, then there’s no incentive to make sure the data is fit for purpose. Eating your own dog food is a great way to avoid these problems.

Written with StackEdit.

Tuesday, October 08, 2024

The Data Citation Corpus revisited

How to cite: Page, R. (2024). The Data Citation Corpus revisited https://doi.org/10.59350/wvwva-v7125

TL;DR

The Data Citation Corpus is still riddled with errors, and it is unclear to what extent it measures citation (resuse of data) versus publication (are most citations between the data and the original publication)?

These are some brief notes on the latest version (v. 2) of the Data Citation Corpus, relased shortly before the Make Data Count Summit 2024, which also included a discussion on the practical uses of the corpus.

I downloaded version 2 from Zenodo doi:10.5281/zenodo.13376773. The data is in JSON format, which I then loaded into CouchDB to play with. Loading the data was relatively quick using CouchDB’s “bulk upload” feature, although building indexes to explore the data takes a little while.

What follows is a series of charts constructed using Vega-Lite.

The top 20 repositories

Top 20 repositories in Data Citation Corpus

The chart above shows the top 20 repositories by number of citations. The biggest repository by some distance is the European Nucleotide Archive, so most of the data being cited are DNA sequences. Those working on biodiversity might be plessed to see that GBIF is 16th out of 20.

Note that Figshare appears twice, as “Figshare” and “figshare”, so there are problems with data cleaning. It’s actually worse than this, because the repository “Taylor & Francis” is an branded instance of Figshare: https://tandf.figshare.com. So if we want to measure the impact of the Figshare repository we will need to cluster the different spellings, as well as check the DOIs for each repository (T&F data DOIs still carry Figshare branding, e.g.doi:10.6084/m9.figshare.14103264.

The vast majority of data in the “Taylor & Francis” is published by Informa UK Ltd which is the parent company of Taylor & Francis. If you visit an article in an Informa journal, such as Enhanced Selectivity of Ultraviolet-Visible Absorption Spectroscopy with Trilinear Decomposition on Spectral pH Measurements for the Interference-Free Determination of Rutin and Isorhamnetin in Chinese Herbal Medicine the supplementary information is stored in Figshare. These links between the publication and the supplementary information are treated as “citations” in the corpus. Is this what we mean by citation? If so, we are not measuring the use of data, but rather its publication.

Top 20 publishers

Top 20 publishers in Data Citation Corpus

The top 20 publishers of articles that cite data in the corpus are shown above. It would be interesting to know how much this reflects publication policies of these publishers (e.g., open versus closed access, indexing in Pubmed, availability of XML, etc.) versus actual citation of data. Biodiversity people might be pleased to see Pensoft appearing in the top 20.

GBIF

You can also explore the data by individual repository. For example, the top 20 publishers of articles citing data in GBIF shows Pensoft at number one. This reflects the subject matter of Pensoft journals, and Pensoft’s focus on best practices for publishing data. From GBIF’s perspective, perhaps that organisation would want to extend their reach beyond core biodiversity journals.

Top 20 publishers citing data from GBIF

Citation

The vast majority of data in the Data Citation Corpus is cited only once.

Frequency of citation

Given that much of these “citations” may be by the publication that makes the data available, it’s not clear to me that the corpus is actually measuring citation (i.e., reuse of the data). Instead it may just be measuring publication (e.g., the link betwene a paper and its supplementary data). To answer this we’d need to drill down into the data more.

Note that some data items have large numbers of citations, the highest is “LY294002” with 9983 citations, with the next being “A549” with 5883 citations. LY294002 is a chemical compound that acts as an inhibitor, and A549 is cell type. The citation corpus regards both as accession numbers for sequences(!). Hence it’s likely that the most cited data records are not data at all, but false matches to other entities, such as chemicals and cells. These false matches are still a major problem for the corpus.

Summary

I think there is so much potential here, but there are significant data quality issues. Anyone basing metrics upon this corpus would need to proceed very carefully.

Written with StackEdit.