Wednesday, June 19, 2024

Visualising big trees: a talk at the Systematics Association 2024

This blog post has some notes in support of a talk given to the Systematics Association meeting in Reading June 20th, 2024.


I will post a link to the slides here once I have given the talk.

Page, Roderic (2024). Visualising big trees. figshare. Presentation.

Example web sites


Kew phylogeny


Catalogue of Life

Background reading

Written with StackEdit.

Tuesday, June 18, 2024

Nanopubs, a way to create even more silos

Pensoft have recently introduced “nanopubs”, small structured publications that can be thought of as containing the minimum possible statement that could be published.

Nanopublications are the smallest units of publishable information: a scientifically meaningful assertion about anything that can be uniquely identified and attributed to its author and serve to communicate a single statement, its original source (provenance) and citation record (publication info). Nanopublications are fully expressed in a way that is both human-readable and machine-interpretable. For more, see, Pensoft blog, this video and on our website. Nanopublications

Nanopubs are promoted as FAIR, that is findable, accessible, interoperabile, and reusable. I like the idea of nanopubs, but the examples I have seen so far are problematic. As an aside, there are reasons not to be optimistic about nanopubs (or text-mining in general), see The Business of Extracting Knowledge from Academic Publications.

I’m going to focus on one nanopub RAXCvEZfCc, which comes from the paper Towards computable taxonomic knowledge: Leveraging nanopublications for sharing new synonyms in the Madagascan genus Helictopleurus (Coleoptera, Scarabaeinae). This nanopub says that Helictopleurus dorbignyi Montreuil, 2005 is a subjective synonym of Helictopleurus halffteri Balthasar, 1964.

In other words,

This seems a fairly simple thing to say, indeed we could say it with a single triple, but the corresponding nanopub requires 33 RDF triples to say this.

<> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> "Subjective synonymy based on morphological comparison of the type specimens of the two species names" <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> "RSA" <> . <> <> "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCnFtZQdjMpPH4duOBwDybRdPo93QCanFGN8cnpyHqZRQ+FINXypUYCNRSx3VBaWZoLVB/CYCoMY0or/oxBQwl5N7Y/8Ebj+G9ZSNsSkM9uo2DL91f26Y1y2UDE7bnajG909kXQnJS1G59cqIaKyLInjMFD5vWnptysj/ljBv3NTwIDAQAB" <> . <> <> "YzTUmwGRmqHiJVyU1A6rPI1bHbAJPS+Zw6hnDPWzZ9a/7TP+yM/HAf5E9BTS3HNKaCgLAHSnsRg5Q0lPauYQyJd9tbLzR6VU/WJv399Z7/qrn4EhgCULkIhrCAkuWzRtSyHMEbuzyu51ZSQCCPgMZ3HwpVtRa+gVDgqu3nsi5x4=" <> . <> <> <> <> . <> <> "2023-12-24T06:24:14.480Z"^^<> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> "Helictopleurus dorbignyi Montreuil, 2005 (species) - ICZN subjective synonym - Helictopleurus halffteri Balthasar, 1964 (species)" <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> <> <> . <> <> "Helictopleurus dorbignyi Montreuil, 2005 (species)" <> . <> <> "Helictopleurus halffteri Balthasar, 1964 (species)" <> .

In part this is because it includes cryptographic signing, presumably to ensure that the statement is what you think it is. There is also a plethora of information about how the nanopublication was derived. Presumably, this is to satisfy reproducibility concerns. But none of this matters if you are producing data that people can’t easily use.

The core statement looks like this:

This graph is saying that there is a triple

By itself this isn’t terribly useful because neither of the two taxa are “things” that have identifiers, they are blank nodes. So, what is the statement about? If we follow the biodiv:hasTaxonName links, we see that there are names associated with these taxa (Helictopleurus dorbignyi, and Helictopleurus halffteri), and these are linked to records in a database in ChecklistBank. This seems complicated, but I assume it is equivalent to saying “in this publication we regard taxa with the names Helictopleurus dorbignyi, and Helictopleurus halffteri to be the same thing”.


I feel that I have been banging this drum for years now, but you cannot have interoperability unless you use the same identifiers for the same things. That means persistent identifiers, identifiers that you have some confidence will be around in ten, 20, or 50 years (at least).

Leaving aside whatever the persistence of the nanopubs themselves, I find it alarming that the link to the source of the statement that these two names are synonyms is not the DOI for the paper 10.3897/BDJ.12.e120304, but a link to the publishing platform ARPHA: This link takes me to a login page, not the actual publication, so I can’t retrieve the source of the statement made in the nanopublication using the nanopublication itself.

The taxon names have as their identifiers and These identifiers are also local to a particular dataset. Why not use identifiers such as the Catalogue of Life entries for these names (i.e., e.g., which supports RDF via embedded JSON-LD) or even LSIDs? We have for Helictopleurus halffteri and for Helictopleurus dorbignyi.

Interestingly, the one well-known external identifier linked to is the ORCID for the author of the nanopub, 0000-0002-1938-6105). I can’t help think that this suggests that authorship of the nanopublication is more important than the fact it publishes.

One can imagine that nanopublications will be registered with authors’ ORCID profiles, which helps flesh out their online CV. This is nice, but where is the equivalent for linking the publication to the nanopub via its DOI, or the taxon names to the nanopub? How do we know whether these nanopubs contradict other nanopubs, or support them, or add new information? For example, there seems to be no way to go from the DOI for the paper to the nanopub.


Another aspect of interoperability is using the same terms to describe relationships. I’m struck by how many different vocabularies the nanopub requires. Some of these are specific to the administrivia of the nanopub, but others are biological.

For example, is used to define the relation between. I confess it’s unclear to me why NOMEN_0000285 isn’t used to directly link the two ChecklistBank records, rather than the indirection via #subjtaxon and #objtaxon, given that is a relationship between names (isn’t it?).

Other ontologies include Biolink-Model and biodiv which I can’t seem to find a description of (the URL resolves to queries on the nanodash site). It amazes me how readily people create new ontologies, especially as in the wider world there is a trend towards one vocabuary to rule them all (


I find it disheartening that the bulk of the information in a nanopub is administrivia about that nanopub. I understand the desire to establish provenance and to cryptographically sign the information, but all this is of limited use if the actual scientific information is poorly expressed.

If nanopubs are to be useful I think they need to:

  • Use persistent identifiers for every entity being referred to, ideally using existing, well-known identifiers. If you are referring to a publication that has a DOI, use that DOI. If you are referring to a taxon or a taxon name, use an appropriate identifier (e.g., an LSID for the name, a URL to a classification).

  • Use simple, existing vocabularies wherever possible. Can you model the data using (and extensions such as Bioschemas). If not, are you sure you can’t?

Unless more care is taken, nanopubs will go the way of much of the RDF world, creating new, even more verbose, even more arcane silos of data. This is partly a consequence of the primary incentive, which is to publish minimal units of information. Given that we now have persistent identifiers for people (ORCIDs) and those identifiers are linked to an infrastructure that can automatically register publications linked to ORCIDs, can we expect to see a flood of nanopubs? What vaue will these have if we can’t make ready use of the “facts” they assert? How will people build tools on top of nanopubs if the only thing that reliably links to the external world is the ORCID of the person who created it.

Written with StackEdit.