Friday, May 27, 2022

Round trip from identifiers to citations and back again

Note to self (basically rewriting last year's Finding citations of specimens).

Bibliographic data supports going from identifier to citation string and back again, so we can do a "round trip."


Given a DOI we can get structured data with a simple HTTP fetch, then use a tool such as citation.js to convert that data into a human-readable string in a variety of formats.

Identifier Structured data Human readable string
10.7717/peerj-cs.214 HTTP with content-negotiation CSL-JSON CSL templates Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214.


Going in the reverse direction (string to identifier) is a little more challenging. In the "old days" a typical strategy was to attempt to parse the citation string into structured data (see AnyStyle for a nice example of this), then we could extract a truple of (journal, volume, starting page) and use that to query CrossRef to find if there was an article with that tuple, which gave us the DOI.

Identifier Structured data Human readable string
10.7717/peerj-cs.214 OpenURL query journal, volume, start page Citation parser Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214.


Another strategy is to take all the citations strings for each DOI, index those in a search engine, then just use a simple search to find the best match to your citation string, and hence the DOI. This is what does.

Identifier Human readable string
10.7717/peerj-cs.214 search Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214.

At the moment my work on material citations (i.e., lists of specimens in taxonomic papers) is focussing on 1 (generating citations from specimen data in GBIF) and 2 (parsing citations into structured data).

Wednesday, May 11, 2022

Thoughts on TreeBASE dying(?)

So it looks like TreeBASE is in trouble, it's legacy Java code a victim of security issues. Perhaps this is a chance to rethink TreeBASE, assuming that a repository of published phylogenies is still considered a worthwhile thing to have (and I think that question is open).

Here's what I think could be done.

  1. The data (individual studies with trees and data) are packaged into whatever format is easiest (NEXUS, XML, JSON) and uploaded to a repository such as Zenodo for long term storage. They get DOIs for citability. This becomes the default storage for TreeBASE.
  2. The data is transformed into JSON and indexed using Elasticsearch. A simple web interface is placed on top so that people can easily find trees (never a strong point of the original TreeBASE). Trees are displayed natively on the web using SVG. The number one goal is for people to be able to find trees, view them, and download them.
  3. To add data to TreeBASE the easiest way would be for people to upload them direct to Zenodo and tag them "treebase". A bot then grabs a feed of these datasets and adds them to the search engine in (1) above. As time allows, add an interface where people upload data directly, it gets curated, then deposited in Zenodo. This presupposes that there are people available to do curation. Maybe have "stars" for the level of curation so that users know whether anyone has checked the data.

There's lots of details to tweak, for example how many of the existing URLs for studies are preserved (some URL mapping), and what about the API? And I'm unclear about the relationship with Dryad.

My sense is that the TreeBASE code is very much of its time (10-15 years ago), a monolithic block of code with SQL, Java, etc. If one was starting from scratch today I don't think this would be the obvious solution. Things have trended towards being simpler, with lots of building blocks now available in the cloud. Need a search engine? Just spin up a container in the cloud and you have one. More and more functionality can be devolved elsewhere.

Another other issue is how to support TreeBASE. It has essentially been a volunteer effort to date, with little or no funding. One reason I think having Zenodo as a storage engine is that it takes care of long term sustainability of the data.

I realise that this is all wild arm waving, but maybe now is the time to reinvent TreeBASE?


It's been a while since I've paid a lot of attention to phylogenetic databases, and it shows. There is a file-based storage system for phylogenies phylesystem (see "Phylesystem: a git-based data store for community-curated phylogenetic estimates" that is sort of what I had in mind, although long term persistence is based on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly horrible-looking JSON transformation of NeXML (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of NeXML or a JSON transformation seems the obvious storage format. It will probably need some cleaning and simplification if it is to be indexed easily. Looking back over the long history of TreeBASE and phylogenetic databases I'm struck by how much complexity has been introduced over time. I think the tech has gotten in the way sometimes (which might just be another way of saying that I'm not smart enough to make sense of it all.

So we could imagine a search engine that covers both TreeBASE and Open Tree of Life studies.

Basic metadata-based searches would be straightforward, and we could have a user interface that highlights the trees (I think TreeBASE's biggest search rival is a Google image search). The harder problem is searching by tree structure, for which there is an interesting literature without any decent implementations that I'm aware of (as I said, I've been out of this field a while).

So my instinct is we could go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need to think a bit more cleverly about higher taxon and tree based searching. I've always thought that one killer query would be not so much "show me all the trees for my taxon" but "show me a synthesis of the trees for my taxon". Imagine a supertree of recent studies that we could use as a summary of our current knowledge, or a visualisation that summarises where there are conflicts among the trees.

Relevant code and sites