Friday, October 18, 2024

Exploring BOLD's DNA barcode data releases: there's a fraction too much friction

Recently I’ve been exploring data downloaded from BOLD. Part of this was motivated by work done with David Schindel for a recent book:

Schindel, D.E., Page, R.M.P. (2024). Creating Virtuous Cycles for DNA Barcoding: A Case Study in Science Innovation, Entrepreneurship, and Diplomacy. In: DeSalle, R. (eds) DNA Barcoding. Methods in Molecular Biology, vol 2744. Humana, New York, NY. doi:10.1007/978-1-0716-3581-0_1

In this blog post I record some struggles I’ve had with the supposedly “Frictionless” data provided by BOLD. I list a serious of issues, and make some recommendations as to how these can be fixed.

Previous versions disappear from site

The web page Data Packages lists datasets that can be downloaded.

The two most recent have the DOIs:

While this makes it easy to link to the latest version of the data, it inhibits reproducibility because the data doi:10.5883/DP-Latest points to can change, and there is (currently) no unique DOI for that particular dataset. Once a version becomes the third oldest, then it seems to get a version-specific DOI.

The list of “Historical Data” releases is not exhaustive. Currently (2024-10-17) there are four older versions listed:

However I have downloaded older versions that are no longer listed on the Data Packages web page. This makes it hard for anyone wanting to trace the history of changes in BOLD data.

Recommendation

Have distinct DOIs for latest version (i.e., in addition to “10.5883/DP-Latest”), and keep list of all releases on the web site.

Even better, switch to using Zenodo to store data, they provide a nicer model of versioning, and can also provide download metrics.

Where are the images?

A major surprise is the lack of URLs for specimen images. The imagery in BOLD is very useful, yet not included in the data export! The only way to get a list of image URLs is to get the data from GBIF(!).

Recommendation

Include images URLs in the data releases.

Column names change over time and aren’t standardised

One major source of frustration is that the labels used for the columns of data can change. This is, pun intended, a major source of friction in supposedly “frictionless” data. Code that successfully parses one dataset may fail with the new release. One could argue that the code should rely solely on the information in the data package, but there are some columns (e.g., geographic coordinates, the raw sequences) that require special treatment, and the natural language descriptions of each column are not machine-readable.

Below is a table of column names from a range of data packages from 2022-09-28 to 2024-09-06. Most column names are stable, but sometimes new ones appear, and sometimes they vanish. Core data elements, such as nucleotide sequences can change, e.g. nuc versus nucraw.

column name 2022-09-28 2023-09-29 2023-10-27 2024-07-26 2024-08-02 2024-08-09 2024-09-06
associated_specimen
associated_specimens
associated_taxa
bin_created_date
bin_uri
biome
bold_recordset_code_arr
class
collection_code
collection_date
collection_date_accuracy
collection_date_end
collection_date_start
collection_event_id
collection_note
collection_notes
collection_time
collectors
coord
coord_accuracy
coord_source
country
country/ocean
country_iso
depth
depth_accuracy
ecoregion
elev
elev_accuracy
extrainfo
family
fieldid
funding_src
gb_acs
genus
geoid
habitat
identification
identification_method
identification_rank
identified_by
identifier_email
insdc_acs
inst
kingdom
life_stage
marker_code
museumid
notes
nuc
nuc_basecount
nucraw
order
phylum
primers_forward
primers_reverse
processid
processid_minted_date
province
realm
record_id
recordset_code_arr
region
reproduction
sampleid
sampling_protocol
sector
sequence_run_site
sequence_upload_date
sex
short_note
site
site_code
sovereign_inst
species
species_reference
specimen_linkout
specimenid
subfamily
subspecies
taxid
taxon_name
taxon_rank
taxonomy_notes
tissue_type
tribe
voucher_type

Recomendation

Avoid changing the names of data columns between releases. Adopt standardised terms, such as Darwin Core, wherever possible. Tell us that these are Darwin Core by using the dwc: prefix.

Use identifiers for people

People appear in several places in the data, notably as identifiers and collectors. The BOLD data uses simple text strings (i.e., names) of people, rather than external identifiers such as ORCIDs. This means we miss out on valuable information. For example, for a 2 million sequence subset of the latest release I was curious as to who identified the most specimens. For each of these names I then wanted to find out who they were. For example, are they taxonomists? If so, what is their expertise? What taxonomic papers have they published? Where are they based? If BOLD included ORCID ids it would be easier to answer these questions. Instead, I resorted to Google and produced the following table:

Top 10 identifiers of BOLD specimens:

Name Affiliation ORCID
Kate Perez U of Guelph 0000-0001-5233-1539
Angela Telfer U of Guelph 0000-0003-1846-6362
Daniel H. Janzen U of Pennsylvania 0000-0002-7335-5107
Valerie Levesque-Beaudin U of Guelph 0000-0002-6053-0949
Gergin A. Blagoev U of Guelph 0000-0003-1844-0779
Renee Miskie U of Guelph -
BOLD ID Engine - -
Brandon MONG Guo jie Academia Sinica, Taipei 0000-0002-1673-8021
Paul D.N. Hebert U of Guelph 0000-0002-3081-6700
Brian Fisher California Academy of Sciences 0000-0002-4653-3270

Note that most of the top ten identifiers work at the University of Guelph, the home of BOLD. This tells us something about the degree to which BOLD is dependent on its own staff to identify specimens, versus the extent to which it has engaged the wider community.

The flip side of this is that these people are curating an important database. Are they getting credit? Is this curation making its way into Bionomia, which has mechanisms to give credit to this work.

I have also briefly looked at collector names, and it is - as one might expect - something of a mess. The same person’s name is written different ways, text strings representing multiple people are incorrectly split into individual names, etc. The description in the data package is more wishful thinking than reality:

Comma separated list of full or abbreviated names of the individuals or teams responsible for collecting the sample in the field.

Recommendation

Add ORCID identifiers for people who have identified specimens.

Method of identification not standardised

The value of BOLD as a tool for identifying new sequences depends on the reliability of existing DNA barcodes. How are these identified? The field identification_method is full of a mix of terms. There are all the obvious traps people fall into when not being careful with data. The same term may be spelt differently and/or is capitalised differently. People add qualifiers to a term, such as the date of identification, making it much harder to ask simple questions such as how many sequences have been identified based on their morphology, versus based on their sequences.

So far I’ve found 1889 different terms for identification method, here are the top 20:

  • BIN Taxonomy Match
  • BOLD ID Engine Manual
  • BOLD Sequence Classifier
  • Morphology
  • morphology
  • morphological
  • BOLD ID Engine (March 2015)
  • Tree based Identification(April 2016)
  • Morphological
  • BIN Taxonomy Match (Mar 2023)
  • BIN Taxonomy Match (May 2019)
  • Tree based Identification (Feb 2017)
  • BIN Taxonomy Match (May 2017)
  • BIN Taxonomy Match (Oct 2022)
  • BIN Taxonomy Match (Aug 2023)
  • BIN Taxonomy Match (Mar 2017)
  • BOLD ID Engine
  • BIN Taxonomy Match (Apr 2017)
  • BIN Taxonomy Match (Jun 2019)
  • Tree Based Identification (April 2016)

Note that we have “Morphology”, “morphology”, “morphological”, and “Morphological”. How is “BOLD ID Engine Manual” different from “BOLD ID Engine”? Note the use of dates as qualifiers.

Recommendation

Enforce a standardised vocabulary, add additional fields for date and notes on identification.

Voucher type not standardised

The description of the voucher_type reads:

“Status of the specimen in an accessioning process.This field uses a controlled vocabulary: ‘Museum Vouchered:Type’, ‘Museum Vouchered:Type Series’, ‘Vouchered:Registered Collection’, ‘To Be Vouchered:Holdup/Private’, ‘E-Vouchered:DNA/Tissue+Photo’, ‘Dna/Tissue Vouchered Only’, ‘No Specimen’.”

This is patently false. Instead of seven terms are at least 508 for this field. Here are the top 20 terms (* indicates a term from the controlled vocabulary):

  • Vouchered:Registered Collection*
  • DNA/Tissue Vouchered Only*
  • To Be Vouchered:Holdup/Private*
  • museum voucher
  • E-Vouchered:DNA/Tissue+Photo*
  • Voucher Type: Morphological
  • No Specimen*
  • Museum voucher, whole specimen in ethanol
  • Vouchered:Private Collection
  • Museum Vouchered:Type*
  • Museum Vouchered:Type Series*
  • Museum voucher, Whole specimen in ethanol
  • Museum voucher, whole specimen
  • Museum Vouchered
  • Museum voucher, e-voucher
  • Museum voucher, Whole specimen
  • Museum voucher, E-vouchered with additional representatives stored in ethanol in parent lot
  • vouchered: not registered collection
  • Museum voucher, E-vouchered with additional representatives stored in ethanol
  • in alcohol (ethanol, 96%)

Recommendation

Enforce the existing controlled vocabulary.

Institutions lack identifiers

Institutions are listed by name. As with any string, there is the potential for different spellings and formatting. Anyone interested in getting metrics for institutional engagement with BOLD, and comparing that to, say, sources of funding, would much rather have identifiers than strings.

Here are the top 20 institutions:

  • Centre for Biodiversity Genomics
  • University of Pennsylvania
  • Area de Conservacion Guanacaste
  • Mined from GenBank, NCBI
  • Canadian National Collection of Insects, Arachnids and Nematodes
  • SNSB, Zoologische Staatssammlung Muenchen
  • Australian National Insect Collection
  • University of Malaya, Museum of Zoology
  • Instituto Nacional de Biodiversidad, Costa Rica
  • Royal Ontario Museum
  • California Academy of Sciences
  • NEON Biorepository at Arizona State University
  • Research Collection of M. Alex Smith
  • Smithsonian Tropical Research Institute
  • University of New Brunswick, Fredericton
  • York University, Packer Collection
  • Wellcome Sanger Institute
  • Smithsonian Institution, National Museum of Natural History
  • Natural History Museum, London
  • University of Oulu, Zoological Museum

Recommendation

Add external identifiers for institutions such as RORs.

Summary

Some of these issues raised here are easy to fix, other will require a lot of curation. I suspect that part of the problem is that there’s no evidence that BOLD itself makes use of these data dumps. If you view data exports as somethign you are “supposed to” do, rather than something that you yourself use, then there’s no incentive to make sure the data is fit for purpose. Eating your own dog food is a great way to avoid these problems.

Written with StackEdit.

Tuesday, October 08, 2024

The Data Citation Corpus revisited

TL;DR

The Data Citation Corpus is still riddled with errors, and it is unclear to what extent it measures citation (resuse of data) versus publication (are most citations between the data and the original publication)?

These are some brief notes on the latest version (v. 2) of the Data Citation Corpus, relased shortly before the Make Data Count Summit 2024, which also included a discussion on the practical uses of the corpus.

I downloaded version 2 from Zenodo doi:10.5281/zenodo.13376773. The data is in JSON format, which I then loaded into CouchDB to play with. Loading the data was relatively quick using CouchDB’s “bulk upload” feature, although building indexes to explore the data takes a little while.

What follows is a series of charts constructed using Vega-Lite.

The top 20 repositories

Top 20 repositories in Data Citation Corpus

The chart above shows the top 20 repositories by number of citations. The biggest repository by some distance is the European Nucleotide Archive, so most of the data being cited are DNA sequences. Those working on biodiversity might be plessed to see that GBIF is 16th out of 20.

Note that Figshare appears twice, as “Figshare” and “figshare”, so there are problems with data cleaning. It’s actually worse than this, because the repository “Taylor & Francis” is an branded instance of Figshare: https://tandf.figshare.com. So if we want to measure the impact of the Figshare repository we will need to cluster the different spellings, as well as check the DOIs for each repository (T&F data DOIs still carry Figshare branding, e.g.doi:10.6084/m9.figshare.14103264.

The vast majority of data in the “Taylor & Francis” is published by Informa UK Ltd which is the parent company of Taylor & Francis. If you visit an article in an Informa journal, such as Enhanced Selectivity of Ultraviolet-Visible Absorption Spectroscopy with Trilinear Decomposition on Spectral pH Measurements for the Interference-Free Determination of Rutin and Isorhamnetin in Chinese Herbal Medicine the supplementary information is stored in Figshare. These links between the publication and the supplementary information are treated as “citations” in the corpus. Is this what we mean by citation? If so, we are not measuring the use of data, but rather its publication.

Top 20 publishers

Top 20 publishers in Data Citation Corpus

The top 20 publishers of articles that cite data in the corpus are shown above. It would be interesting to know how much this reflects publication policies of these publishers (e.g., open versus closed access, indexing in Pubmed, availability of XML, etc.) versus actual citation of data. Biodiversity people might be pleased to see Pensoft appearing in the top 20.

GBIF

You can also explore the data by individual repository. For example, the top 20 publishers of articles citing data in GBIF shows Pensoft at number one. This reflects the subject matter of Pensoft journals, and Pensoft’s focus on best practices for publishing data. From GBIF’s perspective, perhaps that organisation would want to extend their reach beyond core biodiversity journals.

Top 20 publishers citing data from GBIF

Citation

The vast majority of data in the Data Citation Corpus is cited only once.

Frequency of citation

Given that much of these “citations” may be by the publication that makes the data available, it’s not clear to me that the corpus is actually measuring citation (i.e., reuse of the data). Instead it may just be measuring publication (e.g., the link betwene a paper and its supplementary data). To answer this we’d need to drill down into the data more.

Note that some data items have large numbers of citations, the highest is “LY294002” with 9983 citations, with the next being “A549” with 5883 citations. LY294002 is a chemical compound that acts as an inhibitor, and A549 is cell type. The citation corpus regards both as accession numbers for sequences(!). Hence it’s likely that the most cited data records are not data at all, but false matches to other entities, such as chemicals and cells. These false matches are still a major problem for the corpus.

Summary

I think there is so much potential here, but there are significant data quality issues. Anyone basing metrics upon this corpus would need to proceed very carefully.

Written with StackEdit.

Tuesday, August 13, 2024

Why do museum and gallery displays ignore the web?

This post is inspired by the Pharaoh exhibition at the NGV in Melbourne, Australia. This is a beautifully displayed exhibition of objects from the British Museum, London. It has all the trappings of a modern exhibition, beautiful lighting, a custom sound track, and lots of social media coverage. But I found it immensely frustrating to visit.

The reason for my frustration is the missed opportunity to provide visitors with the means to learn more from each object than a few cursory sentences on a display card. Take, for example, the “Lintel of King Amenemhat III”, for which we learn:

This lintel was originally placed in a temple erected by Amenemhat III. The carving reflects the harmonious symmetry followed inside an Egyptian temple. At the centre is a cartouche enclosing the king’s birth name. This is surrounded by inscriptions that radiate from the centre to the sides of the lintel. Names of the king face references to Sobek, the god of the temple, who is depicted as a crocodile seated on a shrine.

That is all that we are told. Yet on this display card is a cryptic code “EA1072”, which to most visitors is likely to be no less obscure than the hieroglyphs on the object itself. EA1072 is the number of this object in the British Museum collection. Each code can be converted into a URL by appending it to https://www.britishmuseum.org/collection/object/Y_, i.e., https://www.britishmuseum.org/collection/object/Y_EA1072. If we click on that URL we get a wealth of additional information, including a more detailed description, a bibliography, even an explanatory YouTube video.

So, for each object it would have been trivial for the NGV to include a QR code that would take the visitor to the British Museum’s web site to discover more information about that object, to put the object in context, and learn more about both the object and those who discovered and interpreted it. I’m guessing that most, if not all visitors, had a mobile phone that could read the code and access the internet.

If taking the visitor to the British Museum rather than the NGV’s web site is a problem, why not do the smart thing and reuse the BM’s codes as “slugs” on the end of an NGV URL (much as the BBC did with Wikipedia, see EOL, the BBC, and Wikipedia)? Even better, get the underlying data (does the BM have an API, or a machine-readable version of their web pages) and provide the same information in multiple languages. Melbourne is a modern multicultural city, here was a chance to engage with visitors in languages other than English.

This exhibition seems like an ideal case for the use of persistent identifiers for museums and other collections, something projects such as Towards a National Collection - HeritagePIDs was working towards (see also Persistent Identifiers: A demo and a rant). If we have persistent identifiers, especially if they resolve to machine-readable data, it becomes easy to convert static text into entry points to a much larger digital world of knowledge. Instead we seem happy to give simple snippets of information in one language, and hope the viewer’s interest hasn’t faded away by the time they exit via the gift shop.

Written with StackEdit.

Tuesday, July 02, 2024

A future for the Biodiversity Heritage Library

Following the 2024 BHL meeting, and the departure of Martin Kalfatovic and the uncertainty the departure of such a pivitol person brings, perhaps it’s time to think about the future of BHL. Below I sketch some thoughts, which are hazy at best. I should say at the outset that I think BHL is an extraordinary project. My goal is to think about ways to enhance its utility and impact.

Three facets

I think BHL, in common with other projects such as GBIF, has three main facets: providers, users, and developers. These communities have different needs, and what works for one community need not work for the others.

Providers

Any project that mobilises data depends on people and organisations that have that data being willing to share it. That community needs a rationale for sharing, tools to share, and a means to demonstrate the value of sharing. The few BHL meetings I’ve been to have been dominated by libraries (it is a library project, after all). BHL meetings typically feature a tour of physical libraries where we gaze at ancient books, many of which are now accessible via the BHL website. There is value in being a member of a club that shares similar goals (making biodiversity literature accessible to a wider audience). From my perspective, a lot of BHL effort and infrastructure is focussed on libraries and library-related tasks. This is natural given its origins, but this means other aspects have been neglected.

Users (readers and more)

BHL users are likely diverse, and range from people like me who want the “hard core” technical literature (e.g., species descriptions) to people who revel in the wealth of imagery available in BHL (AKA “the pretty”) (see the BHL Flickr pages).

The current BHL portal provides a way for people to browse the scanned content, but feels designed primarily for librarians. It is organised by title and scanned volumes, hence it is driven by bibliographic metadata. For a long time, it didn’t support the notion of an “article”, which is why I ended up building BioStor to extract and display individual articles (the unit most academics work with). BHL is now actively adding articles and minting DOIs for articles, which helps embed its content in the wider scholarly landscape. To date these new DOI have been cited 56,000 times.

But the current BHL interface is not ideal for viewing articles. We need something simpler and cleaner, and more like the experience offered by modern journal websites.

Developers and data wranglers

I’m lumping developers and data wranglers together, even though these people may have different goals, they share the desire to get past the web interface to the underlying data. BHL has some great APIs that I and others make extensive use of. But this is different from providing a clean interface to the data. BHL has a wealth of information linked to taxonomic names, people, places, and more. Taxonomic indexing by Global Names has made BHL content much more findable, but there is huge scope for indexing on other features. For example, BioStor extracts latitude and longitude pairs from BHL text. These are shown on the map below, indicating the scope for geographic search in BHL.

What’s next?

I think there’s a case to be made to provide three separate interfaces to BHL.

The first would be for the providers (e.g., libraries), which includes all the behind the scenes infrastructure to do with cataloging, etc., and would also include the current portal. The existing BHL interface is important both to show the complete corpus, and also as a place for serendipitous discovery.

The second interface would be for readers. The obvious candidate here is Open Journal Systems (OJS) which powers many journal sites, including Zootaxa, by far the largest taxonomic journal. Indeed I would argue that BHL should adopt OJS and offer it as a service to existing biodiversity journals that may be struggling to manage their existing publishing. Taxonomic publishing has a very long tail of small journals, as the figure below shows (taken from DNA barcoding and taxonomy: dark taxa and dark texts).

This long tail is often hosted on all manner of custom web sites including Word Press blogs, none of which are ideal. There is an opportunity here for BHL to offer hosting as a, for example, an affordable service, using the same OJS infrastructure it would use to display BHL articles.

The final interface would be a data portal. The goal here is to enable people to retrieve data in ways that they find useful, for example by taxon, geographic location, etc. In an ideal world this might be a knowledge graph, but the gap between what knowledge graphs promise and what they deliver is still significant. As a first pass, probably the way forward is to define a series of simple data objects in JSON, load these into Elasticsearch and provide an API on top. This is essentially what GBIF does, where the data is in Darwin Core and the queries are searches over that data. This same infrastructure could also power searches over the articles in OJS, so that users could easily find the content they want.

This is all pretty arm-wavy at this point, but I think BHL needs to be more outwards facing than it currently is, and needs to think how best to serve the biodiversity community (many of which are already huge fans of BHL), as well as think of ways to enhance its long term sustainability.

Written with StackEdit.

Wednesday, June 19, 2024

Visualising big trees: a talk at the Systematics Association 2024

This blog post has some notes in support of a talk given to the Systematics Association meeting in Reading June 20th, 2024.

Slides

I will post a link to the slides here once I have given the talk.

Page, Roderic (2024). Visualising big trees. figshare. Presentation. https://doi.org/10.6084/m9.figshare.26068693.v1

Example web sites

Demos

Kew phylogeny

NCBI

Catalogue of Life

Background reading

Written with StackEdit.

Tuesday, June 18, 2024

Nanopubs, a way to create even more silos

Pensoft have recently introduced “nanopubs”, small structured publications that can be thought of as containing the minimum possible statement that could be published.

Nanopublications are the smallest units of publishable information: a scientifically meaningful assertion about anything that can be uniquely identified and attributed to its author and serve to communicate a single statement, its original source (provenance) and citation record (publication info). Nanopublications are fully expressed in a way that is both human-readable and machine-interpretable. For more, see https://nanopub.net, Pensoft blog, this video and on our website. Nanopublications

Nanopubs are promoted as FAIR, that is findable, accessible, interoperabile, and reusable. I like the idea of nanopubs, but the examples I have seen so far are problematic. As an aside, there are reasons not to be optimistic about nanopubs (or text-mining in general), see The Business of Extracting Knowledge from Academic Publications.

I’m going to focus on one nanopub RAXCvEZfCc, which comes from the paper Towards computable taxonomic knowledge: Leveraging nanopublications for sharing new synonyms in the Madagascan genus Helictopleurus (Coleoptera, Scarabaeinae). This nanopub says that Helictopleurus dorbignyi Montreuil, 2005 is a subjective synonym of Helictopleurus halffteri Balthasar, 1964.

In other words,

This seems a fairly simple thing to say, indeed we could say it with a single triple, but the corresponding nanopub requires 33 RDF triples to say this.

<https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasAssertion> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasProvenance> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasPublicationInfo> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.nanopub.org/nschema#Nanopublication> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://w3id.org/biolink/vocab/OrganismTaxonToOrganismTaxonAssociation> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <http://www.w3.org/2000/01/rdf-schema#comment> "Subjective synonymy based on morphological comparison of the type specimens of the two species names" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/object> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#objtaxon> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/predicate> <http://purl.obolibrary.org/obo/NOMEN_0000285> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/subject> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#subjtaxon> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#objtaxon> <https://w3id.org/kpxl/biodiv/terms/hasTaxonName> <https://www.checklistbank.org/dataset/9880/taxon/3K9T4> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#subjtaxon> <https://w3id.org/kpxl/biodiv/terms/hasTaxonName> <https://www.checklistbank.org/dataset/9880/taxon/3K9ST> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://rs.tdwg.org/dwc/terms/basisOfRecord> <http://rs.tdwg.org/dwc/terms/PreservedSpecimen> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://www.w3.org/ns/prov#wasAttributedTo> <https://orcid.org/0000-0002-1938-6105> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://www.w3.org/ns/prov#wasDerivedFrom> <https://arpha.pensoft.net/preview.php?document_id=22521> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasAlgorithm> "RSA" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasPublicKey> "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCnFtZQdjMpPH4duOBwDybRdPo93QCanFGN8cnpyHqZRQ+FINXypUYCNRSx3VBaWZoLVB/CYCoMY0or/oxBQwl5N7Y/8Ebj+G9ZSNsSkM9uo2DL91f26Y1y2UDE7bnajG909kXQnJS1G59cqIaKyLInjMFD5vWnptysj/ljBv3NTwIDAQAB" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasSignature> "YzTUmwGRmqHiJVyU1A6rPI1bHbAJPS+Zw6hnDPWzZ9a/7TP+yM/HAf5E9BTS3HNKaCgLAHSnsRg5Q0lPauYQyJd9tbLzR6VU/WJv399Z7/qrn4EhgCULkIhrCAkuWzRtSyHMEbuzyu51ZSQCCPgMZ3HwpVtRa+gVDgqu3nsi5x4=" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasSignatureTarget> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/created> "2023-12-24T06:24:14.480Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/creator> <https://orcid.org/0000-0002-1938-6105> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/license> <https://creativecommons.org/licenses/by/4.0/> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/hasNanopubType> <http://purl.obolibrary.org/obo/NOMEN_0000017> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/hasNanopubType> <https://w3id.org/kpxl/biodiv/terms/BiodivNanopub> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/introduces> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://w3id.org/kpxl/biodiv/terms/BiodivNanopub> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/2000/01/rdf-schema#label> "Helictopleurus dorbignyi Montreuil, 2005 (species) - ICZN subjective synonym - Helictopleurus halffteri Balthasar, 1964 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromProvenanceTemplate> <http://purl.org/np/RAYfEAP8KAu9qhBkCtyq_hshOvTAJOcdfIvGhiGwUqB-M> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAA2MfqdBCzmz9yVWjKLXNbyfBNcwsMmOqcNUxkk1maIM> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAR40PzxS9rmUC2lH2ct7IlYhyEib-3GXY5DkuR8wgHRw> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAh1gm83JiG5M6kDxXhaYT1l49nCzyrckMvTzcPn-iv90> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromTemplate> <http://purl.org/np/RAf9CyiP5zzCWN-J0Ts5k7IrZY52CagaIwM-zRSBmhrC8> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://www.checklistbank.org/dataset/9880/taxon/3K9ST> <https://w3id.org/np/o/ntemplate/hasLabelFromApi> "Helictopleurus dorbignyi Montreuil, 2005 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://www.checklistbank.org/dataset/9880/taxon/3K9T4> <https://w3id.org/np/o/ntemplate/hasLabelFromApi> "Helictopleurus halffteri Balthasar, 1964 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> .

In part this is because it includes cryptographic signing, presumably to ensure that the statement is what you think it is. There is also a plethora of information about how the nanopublication was derived. Presumably, this is to satisfy reproducibility concerns. But none of this matters if you are producing data that people can’t easily use.

The core statement looks like this:

This graph is saying that there is a triple

By itself this isn’t terribly useful because neither of the two taxa are “things” that have identifiers, they are blank nodes. So, what is the statement about? If we follow the biodiv:hasTaxonName links, we see that there are names associated with these taxa (Helictopleurus dorbignyi, and Helictopleurus halffteri), and these are linked to records in a database in ChecklistBank. This seems complicated, but I assume it is equivalent to saying “in this publication we regard taxa with the names Helictopleurus dorbignyi, and Helictopleurus halffteri to be the same thing”.

Interoperablity

I feel that I have been banging this drum for years now, but you cannot have interoperability unless you use the same identifiers for the same things. That means persistent identifiers, identifiers that you have some confidence will be around in ten, 20, or 50 years (at least).

Leaving aside whatever the persistence of the nanopubs themselves, I find it alarming that the link to the source of the statement that these two names are synonyms is not the DOI for the paper 10.3897/BDJ.12.e120304, but a link to the publishing platform ARPHA: https://arpha.pensoft.net/preview.php?document_id=22521. This link takes me to a login page, not the actual publication, so I can’t retrieve the source of the statement made in the nanopublication using the nanopublication itself.

The taxon names have as their identifiers https://www.checklistbank.org/dataset/9880/taxon/3K9T4 and https://www.checklistbank.org/dataset/9880/taxon/3K9ST. These identifiers are also local to a particular dataset. Why not use identifiers such as the Catalogue of Life entries for these names (i.e., e.g. https://www.catalogueoflife.org/data/taxon/3K9T4, which supports RDF via embedded JSON-LD) or even LSIDs? We have urn:lsid:organismnames.com:name:2521540 for Helictopleurus halffteri and urn:lsid:organismnames.com:name:1770738 for Helictopleurus dorbignyi.

Interestingly, the one well-known external identifier linked to is the ORCID for the author of the nanopub, 0000-0002-1938-6105). I can’t help think that this suggests that authorship of the nanopublication is more important than the fact it publishes.

One can imagine that nanopublications will be registered with authors’ ORCID profiles, which helps flesh out their online CV. This is nice, but where is the equivalent for linking the publication to the nanopub via its DOI, or the taxon names to the nanopub? How do we know whether these nanopubs contradict other nanopubs, or support them, or add new information? For example, there seems to be no way to go from the DOI for the paper to the nanopub.

Vocabulary

Another aspect of interoperability is using the same terms to describe relationships. I’m struck by how many different vocabularies the nanopub requires. Some of these are specific to the administrivia of the nanopub, but others are biological.

For example, http://purl.obolibrary.org/obo/NOMEN_0000285 is used to define the relation between. I confess it’s unclear to me why NOMEN_0000285 isn’t used to directly link the two ChecklistBank records, rather than the indirection via #subjtaxon and #objtaxon, given that is a relationship between names (isn’t it?).

Other ontologies include Biolink-Model and biodiv which I can’t seem to find a description of (the URL resolves to queries on the nanodash site). It amazes me how readily people create new ontologies, especially as in the wider world there is a trend towards one vocabuary to rule them all (schema.org).

Summary

I find it disheartening that the bulk of the information in a nanopub is administrivia about that nanopub. I understand the desire to establish provenance and to cryptographically sign the information, but all this is of limited use if the actual scientific information is poorly expressed.

If nanopubs are to be useful I think they need to:

  • Use persistent identifiers for every entity being referred to, ideally using existing, well-known identifiers. If you are referring to a publication that has a DOI, use that DOI. If you are referring to a taxon or a taxon name, use an appropriate identifier (e.g., an LSID for the name, a URL to a classification).

  • Use simple, existing vocabularies wherever possible. Can you model the data using schema.org (and extensions such as Bioschemas). If not, are you sure you can’t?

Unless more care is taken, nanopubs will go the way of much of the RDF world, creating new, even more verbose, even more arcane silos of data. This is partly a consequence of the primary incentive, which is to publish minimal units of information. Given that we now have persistent identifiers for people (ORCIDs) and those identifiers are linked to an infrastructure that can automatically register publications linked to ORCIDs, can we expect to see a flood of nanopubs? What vaue will these have if we can’t make ready use of the “facts” they assert? How will people build tools on top of nanopubs if the only thing that reliably links to the external world is the ORCID of the person who created it.

Written with StackEdit.

Friday, April 19, 2024

Notes on transforming BHL images

How to cite: Page, R. (2024). Notes on transforming BHL images https://doi.org/10.59350/2gpbb-98a53

I’ve been down this road before, e.g. BHL, DjVu, and reading the f*cking manual and Demo of full-text indexing of BHL using CouchDB hosted by Cloudant, but I’m revisiting converting BHL page scans to black and white images, partly to clean them up, to make them closer to what a modern reader might expect, and partly to reduce the size of the image. The latter means faster loading times and smaller PDFs for articles.

The links above explored using foreground image layers from DjVu (less useful now that DjVu is almost dead as a format), and using CSS in web browsers to convert a colour image to gray scale. I’ve also experimented with the approach taken by Google Books (see https://github.com/rdmpage/google-book-images), which uses jbig2enc to compress images and reduce the number of colours.

In my latest experiments, I use jbig2enc to transform BHL page images into black and white images where each pixel is either black or white (i.e., image depth = 1), then use ImageMagick to resize the image to the Google Books width of 685 pixels and a depth of 2. Typically this gives an image around 25Kb - 30Kb in size. It looks clean and readable.

This approach breaks down for photographs and especially colour plates. For example, this image looks horrible:

When compressing images that have photos or illustrations jbig2enc can extract the part of the image that includes the illustration, for example:

This isn’t perfect, but it raises the possibility that we can convert text and line drawings to black and white, and then add back photographs and plates (whether black or white, or colour). After some experimentation using tools such as ImageMagick composite I have a simple workflow:

  • compress page image using jbig2enc
  • take the extracted illustration and set all white pixels to be transparent
  • convert the black and white image output by jbig2enc to colour (required for the next step)
  • create a composite image by overlaying the extracted illustration (now on a transparent background) on top of the black-and-white page image

The result looks passable:

In this case, we still have a lot of the sepia-toned background, the illustration hasn’t been cleanly separated, but we do at least get some colour.

Still work to do, but it looks promising and suggests a way to make dramatically smaller PDFs of BHL content. There are crude code and example files in GitHub.

Update

Some Googling turned up Removing orange tint-mask from color-negatives, which gives us the following command:

convert 16281585.jpg -negate -channel all -normalize -negate -channel all 16281585-rgb.jpg

Applying this to our image results in:

This looks a lot better. Results will vary depending on the eveness of the page scan (i.e., is there a shadow on the image), but I think this gives us a way to display the plates with a higher degree of contrast.

Reading

Adam Langley, Dan S. Bloomberg, “Google Books: making the public domain universally accessible”, Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000H (2007/01/29); doi:10.1117/12.710609

Written with StackEdit.