Tuesday, October 08, 2024

The Data Citation Corpus revisited

TL;DR

The Data Citation Corpus is still riddled with errors, and it is unclear to what extent it measures citation (resuse of data) versus publication (are most citations between the data and the original publication)?

These are some brief notes on the latest version (v. 2) of the Data Citation Corpus, relased shortly before the Make Data Count Summit 2024, which also included a discussion on the practical uses of the corpus.

I downloaded version 2 from Zenodo doi:10.5281/zenodo.13376773. The data is in JSON format, which I then loaded into CouchDB to play with. Loading the data was relatively quick using CouchDB’s “bulk upload” feature, although building indexes to explore the data takes a little while.

What follows is a series of charts constructed using Vega-Lite.

The top 20 repositories

Top 20 repositories in Data Citation Corpus

The chart above shows the top 20 repositories by number of citations. The biggest repository by some distance is the European Nucleotide Archive, so most of the data being cited are DNA sequences. Those working on biodiversity might be plessed to see that GBIF is 16th out of 20.

Note that Figshare appears twice, as “Figshare” and “figshare”, so there are problems with data cleaning. It’s actually worse than this, because the repository “Taylor & Francis” is an branded instance of Figshare: https://tandf.figshare.com. So if we want to measure the impact of the Figshare repository we will need to cluster the different spellings, as well as check the DOIs for each repository (T&F data DOIs still carry Figshare branding, e.g.doi:10.6084/m9.figshare.14103264.

The vast majority of data in the “Taylor & Francis” is published by Informa UK Ltd which is the parent company of Taylor & Francis. If you visit an article in an Informa journal, such as Enhanced Selectivity of Ultraviolet-Visible Absorption Spectroscopy with Trilinear Decomposition on Spectral pH Measurements for the Interference-Free Determination of Rutin and Isorhamnetin in Chinese Herbal Medicine the supplementary information is stored in Figshare. These links between the publication and the supplementary information are treated as “citations” in the corpus. Is this what we mean by citation? If so, we are not measuring the use of data, but rather its publication.

Top 20 publishers

Top 20 publishers in Data Citation Corpus

The top 20 publishers of articles that cite data in the corpus are shown above. It would be interesting to know how much this reflects publication policies of these publishers (e.g., open versus closed access, indexing in Pubmed, availability of XML, etc.) versus actual citation of data. Biodiversity people might be pleased to see Pensoft appearing in the top 20.

GBIF

You can also explore the data by individual repository. For example, the top 20 publishers of articles citing data in GBIF shows Pensoft at number one. This reflects the subject matter of Pensoft journals, and Pensoft’s focus on best practices for publishing data. From GBIF’s perspective, perhaps that organisation would want to extend their reach beyond core biodiversity journals.

Top 20 publishers citing data from GBIF

Citation

The vast majority of data in the Data Citation Corpus is cited only once.

Frequency of citation

Given that much of these “citations” may be by the publication that makes the data available, it’s not clear to me that the corpus is actually measuring citation (i.e., reuse of the data). Instead it may just be measuring publication (e.g., the link betwene a paper and its supplementary data). To answer this we’d need to drill down into the data more.

Note that some data items have large numbers of citations, the highest is “LY294002” with 9983 citations, with the next being “A549” with 5883 citations. LY294002 is a chemical compound that acts as an inhibitor, and A549 is cell type. The citation corpus regards both as accession numbers for sequences(!). Hence it’s likely that the most cited data records are not data at all, but false matches to other entities, such as chemicals and cells. These false matches are still a major problem for the corpus.

Summary

I think there is so much potential here, but there are significant data quality issues. Anyone basing metrics upon this corpus would need to proceed very carefully.

Written with StackEdit.

Tuesday, August 13, 2024

Why do museum and gallery displays ignore the web?

This post is inspired by the Pharaoh exhibition at the NGV in Melbourne, Australia. This is a beautifully displayed exhibition of objects from the British Museum, London. It has all the trappings of a modern exhibition, beautiful lighting, a custom sound track, and lots of social media coverage. But I found it immensely frustrating to visit.

The reason for my frustration is the missed opportunity to provide visitors with the means to learn more from each object than a few cursory sentences on a display card. Take, for example, the “Lintel of King Amenemhat III”, for which we learn:

This lintel was originally placed in a temple erected by Amenemhat III. The carving reflects the harmonious symmetry followed inside an Egyptian temple. At the centre is a cartouche enclosing the king’s birth name. This is surrounded by inscriptions that radiate from the centre to the sides of the lintel. Names of the king face references to Sobek, the god of the temple, who is depicted as a crocodile seated on a shrine.

That is all that we are told. Yet on this display card is a cryptic code “EA1072”, which to most visitors is likely to be no less obscure than the hieroglyphs on the object itself. EA1072 is the number of this object in the British Museum collection. Each code can be converted into a URL by appending it to https://www.britishmuseum.org/collection/object/Y_, i.e., https://www.britishmuseum.org/collection/object/Y_EA1072. If we click on that URL we get a wealth of additional information, including a more detailed description, a bibliography, even an explanatory YouTube video.

So, for each object it would have been trivial for the NGV to include a QR code that would take the visitor to the British Museum’s web site to discover more information about that object, to put the object in context, and learn more about both the object and those who discovered and interpreted it. I’m guessing that most, if not all visitors, had a mobile phone that could read the code and access the internet.

If taking the visitor to the British Museum rather than the NGV’s web site is a problem, why not do the smart thing and reuse the BM’s codes as “slugs” on the end of an NGV URL (much as the BBC did with Wikipedia, see EOL, the BBC, and Wikipedia)? Even better, get the underlying data (does the BM have an API, or a machine-readable version of their web pages) and provide the same information in multiple languages. Melbourne is a modern multicultural city, here was a chance to engage with visitors in languages other than English.

This exhibition seems like an ideal case for the use of persistent identifiers for museums and other collections, something projects such as Towards a National Collection - HeritagePIDs was working towards (see also Persistent Identifiers: A demo and a rant). If we have persistent identifiers, especially if they resolve to machine-readable data, it becomes easy to convert static text into entry points to a much larger digital world of knowledge. Instead we seem happy to give simple snippets of information in one language, and hope the viewer’s interest hasn’t faded away by the time they exit via the gift shop.

Written with StackEdit.

Tuesday, July 02, 2024

A future for the Biodiversity Heritage Library

Following the 2024 BHL meeting, and the departure of Martin Kalfatovic and the uncertainty the departure of such a pivitol person brings, perhaps it’s time to think about the future of BHL. Below I sketch some thoughts, which are hazy at best. I should say at the outset that I think BHL is an extraordinary project. My goal is to think about ways to enhance its utility and impact.

Three facets

I think BHL, in common with other projects such as GBIF, has three main facets: providers, users, and developers. These communities have different needs, and what works for one community need not work for the others.

Providers

Any project that mobilises data depends on people and organisations that have that data being willing to share it. That community needs a rationale for sharing, tools to share, and a means to demonstrate the value of sharing. The few BHL meetings I’ve been to have been dominated by libraries (it is a library project, after all). BHL meetings typically feature a tour of physical libraries where we gaze at ancient books, many of which are now accessible via the BHL website. There is value in being a member of a club that shares similar goals (making biodiversity literature accessible to a wider audience). From my perspective, a lot of BHL effort and infrastructure is focussed on libraries and library-related tasks. This is natural given its origins, but this means other aspects have been neglected.

Users (readers and more)

BHL users are likely diverse, and range from people like me who want the “hard core” technical literature (e.g., species descriptions) to people who revel in the wealth of imagery available in BHL (AKA “the pretty”) (see the BHL Flickr pages).

The current BHL portal provides a way for people to browse the scanned content, but feels designed primarily for librarians. It is organised by title and scanned volumes, hence it is driven by bibliographic metadata. For a long time, it didn’t support the notion of an “article”, which is why I ended up building BioStor to extract and display individual articles (the unit most academics work with). BHL is now actively adding articles and minting DOIs for articles, which helps embed its content in the wider scholarly landscape. To date these new DOI have been cited 56,000 times.

But the current BHL interface is not ideal for viewing articles. We need something simpler and cleaner, and more like the experience offered by modern journal websites.

Developers and data wranglers

I’m lumping developers and data wranglers together, even though these people may have different goals, they share the desire to get past the web interface to the underlying data. BHL has some great APIs that I and others make extensive use of. But this is different from providing a clean interface to the data. BHL has a wealth of information linked to taxonomic names, people, places, and more. Taxonomic indexing by Global Names has made BHL content much more findable, but there is huge scope for indexing on other features. For example, BioStor extracts latitude and longitude pairs from BHL text. These are shown on the map below, indicating the scope for geographic search in BHL.

What’s next?

I think there’s a case to be made to provide three separate interfaces to BHL.

The first would be for the providers (e.g., libraries), which includes all the behind the scenes infrastructure to do with cataloging, etc., and would also include the current portal. The existing BHL interface is important both to show the complete corpus, and also as a place for serendipitous discovery.

The second interface would be for readers. The obvious candidate here is Open Journal Systems (OJS) which powers many journal sites, including Zootaxa, by far the largest taxonomic journal. Indeed I would argue that BHL should adopt OJS and offer it as a service to existing biodiversity journals that may be struggling to manage their existing publishing. Taxonomic publishing has a very long tail of small journals, as the figure below shows (taken from DNA barcoding and taxonomy: dark taxa and dark texts).

This long tail is often hosted on all manner of custom web sites including Word Press blogs, none of which are ideal. There is an opportunity here for BHL to offer hosting as a, for example, an affordable service, using the same OJS infrastructure it would use to display BHL articles.

The final interface would be a data portal. The goal here is to enable people to retrieve data in ways that they find useful, for example by taxon, geographic location, etc. In an ideal world this might be a knowledge graph, but the gap between what knowledge graphs promise and what they deliver is still significant. As a first pass, probably the way forward is to define a series of simple data objects in JSON, load these into Elasticsearch and provide an API on top. This is essentially what GBIF does, where the data is in Darwin Core and the queries are searches over that data. This same infrastructure could also power searches over the articles in OJS, so that users could easily find the content they want.

This is all pretty arm-wavy at this point, but I think BHL needs to be more outwards facing than it currently is, and needs to think how best to serve the biodiversity community (many of which are already huge fans of BHL), as well as think of ways to enhance its long term sustainability.

Written with StackEdit.

Wednesday, June 19, 2024

Visualising big trees: a talk at the Systematics Association 2024

This blog post has some notes in support of a talk given to the Systematics Association meeting in Reading June 20th, 2024.

Slides

I will post a link to the slides here once I have given the talk.

Page, Roderic (2024). Visualising big trees. figshare. Presentation. https://doi.org/10.6084/m9.figshare.26068693.v1

Example web sites

Demos

Kew phylogeny

NCBI

Catalogue of Life

Background reading

Written with StackEdit.

Tuesday, June 18, 2024

Nanopubs, a way to create even more silos

Pensoft have recently introduced “nanopubs”, small structured publications that can be thought of as containing the minimum possible statement that could be published.

Nanopublications are the smallest units of publishable information: a scientifically meaningful assertion about anything that can be uniquely identified and attributed to its author and serve to communicate a single statement, its original source (provenance) and citation record (publication info). Nanopublications are fully expressed in a way that is both human-readable and machine-interpretable. For more, see https://nanopub.net, Pensoft blog, this video and on our website. Nanopublications

Nanopubs are promoted as FAIR, that is findable, accessible, interoperabile, and reusable. I like the idea of nanopubs, but the examples I have seen so far are problematic. As an aside, there are reasons not to be optimistic about nanopubs (or text-mining in general), see The Business of Extracting Knowledge from Academic Publications.

I’m going to focus on one nanopub RAXCvEZfCc, which comes from the paper Towards computable taxonomic knowledge: Leveraging nanopublications for sharing new synonyms in the Madagascan genus Helictopleurus (Coleoptera, Scarabaeinae). This nanopub says that Helictopleurus dorbignyi Montreuil, 2005 is a subjective synonym of Helictopleurus halffteri Balthasar, 1964.

In other words,

This seems a fairly simple thing to say, indeed we could say it with a single triple, but the corresponding nanopub requires 33 RDF triples to say this.

<https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasAssertion> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasProvenance> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasPublicationInfo> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.nanopub.org/nschema#Nanopublication> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://w3id.org/biolink/vocab/OrganismTaxonToOrganismTaxonAssociation> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <http://www.w3.org/2000/01/rdf-schema#comment> "Subjective synonymy based on morphological comparison of the type specimens of the two species names" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/object> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#objtaxon> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/predicate> <http://purl.obolibrary.org/obo/NOMEN_0000285> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/subject> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#subjtaxon> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#objtaxon> <https://w3id.org/kpxl/biodiv/terms/hasTaxonName> <https://www.checklistbank.org/dataset/9880/taxon/3K9T4> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#subjtaxon> <https://w3id.org/kpxl/biodiv/terms/hasTaxonName> <https://www.checklistbank.org/dataset/9880/taxon/3K9ST> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://rs.tdwg.org/dwc/terms/basisOfRecord> <http://rs.tdwg.org/dwc/terms/PreservedSpecimen> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://www.w3.org/ns/prov#wasAttributedTo> <https://orcid.org/0000-0002-1938-6105> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://www.w3.org/ns/prov#wasDerivedFrom> <https://arpha.pensoft.net/preview.php?document_id=22521> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasAlgorithm> "RSA" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasPublicKey> "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCnFtZQdjMpPH4duOBwDybRdPo93QCanFGN8cnpyHqZRQ+FINXypUYCNRSx3VBaWZoLVB/CYCoMY0or/oxBQwl5N7Y/8Ebj+G9ZSNsSkM9uo2DL91f26Y1y2UDE7bnajG909kXQnJS1G59cqIaKyLInjMFD5vWnptysj/ljBv3NTwIDAQAB" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasSignature> "YzTUmwGRmqHiJVyU1A6rPI1bHbAJPS+Zw6hnDPWzZ9a/7TP+yM/HAf5E9BTS3HNKaCgLAHSnsRg5Q0lPauYQyJd9tbLzR6VU/WJv399Z7/qrn4EhgCULkIhrCAkuWzRtSyHMEbuzyu51ZSQCCPgMZ3HwpVtRa+gVDgqu3nsi5x4=" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasSignatureTarget> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/created> "2023-12-24T06:24:14.480Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/creator> <https://orcid.org/0000-0002-1938-6105> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/license> <https://creativecommons.org/licenses/by/4.0/> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/hasNanopubType> <http://purl.obolibrary.org/obo/NOMEN_0000017> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/hasNanopubType> <https://w3id.org/kpxl/biodiv/terms/BiodivNanopub> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/introduces> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://w3id.org/kpxl/biodiv/terms/BiodivNanopub> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/2000/01/rdf-schema#label> "Helictopleurus dorbignyi Montreuil, 2005 (species) - ICZN subjective synonym - Helictopleurus halffteri Balthasar, 1964 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromProvenanceTemplate> <http://purl.org/np/RAYfEAP8KAu9qhBkCtyq_hshOvTAJOcdfIvGhiGwUqB-M> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAA2MfqdBCzmz9yVWjKLXNbyfBNcwsMmOqcNUxkk1maIM> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAR40PzxS9rmUC2lH2ct7IlYhyEib-3GXY5DkuR8wgHRw> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAh1gm83JiG5M6kDxXhaYT1l49nCzyrckMvTzcPn-iv90> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromTemplate> <http://purl.org/np/RAf9CyiP5zzCWN-J0Ts5k7IrZY52CagaIwM-zRSBmhrC8> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://www.checklistbank.org/dataset/9880/taxon/3K9ST> <https://w3id.org/np/o/ntemplate/hasLabelFromApi> "Helictopleurus dorbignyi Montreuil, 2005 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://www.checklistbank.org/dataset/9880/taxon/3K9T4> <https://w3id.org/np/o/ntemplate/hasLabelFromApi> "Helictopleurus halffteri Balthasar, 1964 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> .

In part this is because it includes cryptographic signing, presumably to ensure that the statement is what you think it is. There is also a plethora of information about how the nanopublication was derived. Presumably, this is to satisfy reproducibility concerns. But none of this matters if you are producing data that people can’t easily use.

The core statement looks like this:

This graph is saying that there is a triple

By itself this isn’t terribly useful because neither of the two taxa are “things” that have identifiers, they are blank nodes. So, what is the statement about? If we follow the biodiv:hasTaxonName links, we see that there are names associated with these taxa (Helictopleurus dorbignyi, and Helictopleurus halffteri), and these are linked to records in a database in ChecklistBank. This seems complicated, but I assume it is equivalent to saying “in this publication we regard taxa with the names Helictopleurus dorbignyi, and Helictopleurus halffteri to be the same thing”.

Interoperablity

I feel that I have been banging this drum for years now, but you cannot have interoperability unless you use the same identifiers for the same things. That means persistent identifiers, identifiers that you have some confidence will be around in ten, 20, or 50 years (at least).

Leaving aside whatever the persistence of the nanopubs themselves, I find it alarming that the link to the source of the statement that these two names are synonyms is not the DOI for the paper 10.3897/BDJ.12.e120304, but a link to the publishing platform ARPHA: https://arpha.pensoft.net/preview.php?document_id=22521. This link takes me to a login page, not the actual publication, so I can’t retrieve the source of the statement made in the nanopublication using the nanopublication itself.

The taxon names have as their identifiers https://www.checklistbank.org/dataset/9880/taxon/3K9T4 and https://www.checklistbank.org/dataset/9880/taxon/3K9ST. These identifiers are also local to a particular dataset. Why not use identifiers such as the Catalogue of Life entries for these names (i.e., e.g. https://www.catalogueoflife.org/data/taxon/3K9T4, which supports RDF via embedded JSON-LD) or even LSIDs? We have urn:lsid:organismnames.com:name:2521540 for Helictopleurus halffteri and urn:lsid:organismnames.com:name:1770738 for Helictopleurus dorbignyi.

Interestingly, the one well-known external identifier linked to is the ORCID for the author of the nanopub, 0000-0002-1938-6105). I can’t help think that this suggests that authorship of the nanopublication is more important than the fact it publishes.

One can imagine that nanopublications will be registered with authors’ ORCID profiles, which helps flesh out their online CV. This is nice, but where is the equivalent for linking the publication to the nanopub via its DOI, or the taxon names to the nanopub? How do we know whether these nanopubs contradict other nanopubs, or support them, or add new information? For example, there seems to be no way to go from the DOI for the paper to the nanopub.

Vocabulary

Another aspect of interoperability is using the same terms to describe relationships. I’m struck by how many different vocabularies the nanopub requires. Some of these are specific to the administrivia of the nanopub, but others are biological.

For example, http://purl.obolibrary.org/obo/NOMEN_0000285 is used to define the relation between. I confess it’s unclear to me why NOMEN_0000285 isn’t used to directly link the two ChecklistBank records, rather than the indirection via #subjtaxon and #objtaxon, given that is a relationship between names (isn’t it?).

Other ontologies include Biolink-Model and biodiv which I can’t seem to find a description of (the URL resolves to queries on the nanodash site). It amazes me how readily people create new ontologies, especially as in the wider world there is a trend towards one vocabuary to rule them all (schema.org).

Summary

I find it disheartening that the bulk of the information in a nanopub is administrivia about that nanopub. I understand the desire to establish provenance and to cryptographically sign the information, but all this is of limited use if the actual scientific information is poorly expressed.

If nanopubs are to be useful I think they need to:

  • Use persistent identifiers for every entity being referred to, ideally using existing, well-known identifiers. If you are referring to a publication that has a DOI, use that DOI. If you are referring to a taxon or a taxon name, use an appropriate identifier (e.g., an LSID for the name, a URL to a classification).

  • Use simple, existing vocabularies wherever possible. Can you model the data using schema.org (and extensions such as Bioschemas). If not, are you sure you can’t?

Unless more care is taken, nanopubs will go the way of much of the RDF world, creating new, even more verbose, even more arcane silos of data. This is partly a consequence of the primary incentive, which is to publish minimal units of information. Given that we now have persistent identifiers for people (ORCIDs) and those identifiers are linked to an infrastructure that can automatically register publications linked to ORCIDs, can we expect to see a flood of nanopubs? What vaue will these have if we can’t make ready use of the “facts” they assert? How will people build tools on top of nanopubs if the only thing that reliably links to the external world is the ORCID of the person who created it.

Written with StackEdit.

Friday, April 19, 2024

Notes on transforming BHL images

How to cite: Page, R. (2024). Notes on transforming BHL images https://doi.org/10.59350/2gpbb-98a53

I’ve been down this road before, e.g. BHL, DjVu, and reading the f*cking manual and Demo of full-text indexing of BHL using CouchDB hosted by Cloudant, but I’m revisiting converting BHL page scans to black and white images, partly to clean them up, to make them closer to what a modern reader might expect, and partly to reduce the size of the image. The latter means faster loading times and smaller PDFs for articles.

The links above explored using foreground image layers from DjVu (less useful now that DjVu is almost dead as a format), and using CSS in web browsers to convert a colour image to gray scale. I’ve also experimented with the approach taken by Google Books (see https://github.com/rdmpage/google-book-images), which uses jbig2enc to compress images and reduce the number of colours.

In my latest experiments, I use jbig2enc to transform BHL page images into black and white images where each pixel is either black or white (i.e., image depth = 1), then use ImageMagick to resize the image to the Google Books width of 685 pixels and a depth of 2. Typically this gives an image around 25Kb - 30Kb in size. It looks clean and readable.

This approach breaks down for photographs and especially colour plates. For example, this image looks horrible:

When compressing images that have photos or illustrations jbig2enc can extract the part of the image that includes the illustration, for example:

This isn’t perfect, but it raises the possibility that we can convert text and line drawings to black and white, and then add back photographs and plates (whether black or white, or colour). After some experimentation using tools such as ImageMagick composite I have a simple workflow:

  • compress page image using jbig2enc
  • take the extracted illustration and set all white pixels to be transparent
  • convert the black and white image output by jbig2enc to colour (required for the next step)
  • create a composite image by overlaying the extracted illustration (now on a transparent background) on top of the black-and-white page image

The result looks passable:

In this case, we still have a lot of the sepia-toned background, the illustration hasn’t been cleanly separated, but we do at least get some colour.

Still work to do, but it looks promising and suggests a way to make dramatically smaller PDFs of BHL content. There are crude code and example files in GitHub.

Update

Some Googling turned up Removing orange tint-mask from color-negatives, which gives us the following command:

convert 16281585.jpg -negate -channel all -normalize -negate -channel all 16281585-rgb.jpg

Applying this to our image results in:

This looks a lot better. Results will vary depending on the eveness of the page scan (i.e., is there a shadow on the image), but I think this gives us a way to display the plates with a higher degree of contrast.

Reading

Adam Langley, Dan S. Bloomberg, “Google Books: making the public domain universally accessible”, Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000H (2007/01/29); doi:10.1117/12.710609

Written with StackEdit.

Wednesday, March 27, 2024

Hugging Face Autotrain

How to cite: Page, R. (2024). Hugging Face Autotrain https://doi.org/10.59350/7p1n4-wdv84

These are notes to myself on using Hugging Face AutoTrain. The first version of this had a very nice interface where you could simply upload a folder of images and train a model. It was limited in the range of tasks and models, but made up for that in ease of use. Now AutoTrain has been replaced by AutoTrain Advanced, which not everyone is happy about.

Training a model

After a bit of fussing about (and paying attention to the log messages) I’ve managed to train a model to classify images in much the same way as before. The steps are as follows:

Go to AutoTrain Advanced. You should see a screen like this:

By default Docker and AutoTrain are selected. It will also show the free hardware spec (CPU basic • 2 vCPU • 16GB). I found that for image classification this hardware choice would cause AutoTrain to fail, so I selected Nvidia T4 small • 4 vCPU • 15GB.

Give your space a name and click on Create Space to create the space. You will now see something like this:

It took 3-4 minutes to build the space. Once the space is built you will then be asked to log in to Hugging Face (seems odd, but that’s what it asks you to do). You are then asked to give your space permissions to connect to your account.

Now you will see a slightly scary looking interface (this is one reason why people miss the old “easy” AutoTrain).

For Task I selected Image Classification and the default base model (google/vit-base-patch16-224). I ignored every other setting, and simply uploaded the training data. This was a zip file containing separate folders for each category of image, so that images, say of cats, would be in a folder called cats, pictures of dogs would be in dogs, etc.

I then clicked Start and after a warning that this would cost money (I subscribe to Hugging Face)saw this:

You can track progress in the logs, which you can see using the middle of the buttons below.

Once completed, the space pauses, which is a little alarming but simply means that it has finished training. Yay, you now have a trained model!

When I first tried this, I got errors because I didn’t upload the data in the proper format (my zip file had a folder that contained the training data folders, it needs the folders to be in the root of the zip archive). It also failed to train on the base (free) hardware, I only discovered this by looking at the logs and see error messages regarding the lack of a GPU.

What now?

The other thing about the original AutoTrain was that it gave you an app to explore how you model worked on other data. The new AutoTrain simply pauses after training and you are left with “um, what do I do now?”

After some fussing I discovered that in my profile I now had a brand new Model appearing in my list of models.

If I click on the model I go to the model page, where there is a Deploy button, this is how you get an app. First though, make sure your model is publicly visible (by default it is private). Click on Settings and go to the Change model visibility to make it public. If you now click on the Deploy button you will see a list of options:

I picked Spaces. This enables you to create a simple online app. I accepted all the defaults (including the base, free hardware with no GPU) and in a couple of minutes you get a app that looks like this:

Upload an image, press Submit and you will get a classification of that image:

Apps tend to sleep, so it may be that you come back to an app, load and image, and get an error message that the model is still loading. Wait a moment, try again, and it should work.

API

Using the app is fun, but if you wasn’t to use the model to classify lots of images then you want to use the API. The Deploy button lists `Inferences API (serverless) as an option. Clicking on that gives you the URL you can to POST images to, it will return the results in JSON. As with the app, if the model is sleeping then your first call may through an error, typically wait a moment and try again, and then you can classify images in bulk.

Summary

Hugging Face is quite an extraordinary tool, and it is a way to try and make sense of the xplosiuon of AI techniques available. But it is clearly written by developers for developers, and that can make it intimidating, even for someone like me who writes code, uses GitHub, etc. The original AutoTrain was a joy to use in comparison, and this feels like a missed opportunity where Hugging Face could have keep both the old "easy" version alongside the new, more powerful, but rather clunkier "advanced" version. Still, this is easier than dealing directly with the hellscape that is Python.

Written with StackEdit.