Monday, August 22, 2022

Linking taxonomic names to the literature

Just some thoughts as I work through some datasets linking taxonomic names to the literature.

In the diagram above I've tried to capture the different situatios I encounter. Much of the work I've done on this has focussed on case 1 in the diagram: I want to link a taxonomic name to an identifier for the work in which that name was published. In practise this means linking names to DOIs. This has the advantage of linking to a citable indentifier, raising questions such as whether citations of taxonmic papers by taxonomic databases could become part of a taxonomist's Google Scholar profile.

In many taxonomic databases full work-level citations are not the norm, instead taxonomists cite one or more pages within a work that are relevant to a taxonomic name. These "microcitations" (what the U.S. legal profession refer to as "point citations" or "pincites", see What are pincites, pinpoints, or jump legal references?) require some work to map to the work itself (which is typically the thing that has a citatble identifier such as a DOI).

Microcitations (case 2 in the diagram above) can be quite complex. Some might simply mention a single page, but others might list a series of (not necessarily contiguous) pages, as well as figures, plates etc. Converting these to citable identifiers can be tricky, especially as in most cases we don't have page-level identifiers. The Biodiversity Heritage Library (BHL) does have URLs for each scanned page, and we have a standard for referring to pages in a PDF (page=<pageNum>, see RFC 8118). But how do we refer to a set of pages? Do we pick the first page? Do we try and represent a set of pages, and if so, how?

Another issue with page-level identifiers is that not everything on a given page may be relevant to the taxonomic name. In case 2 above I've shaded in the parts of the pages and figure that refer to the taxonomic name. An example where this can be problematic is the recent test case I created for BHL where a page image was included for the taxonomic name Aphrophora impressa. The image includes the species description and a illustration, as well as text that relates to other species.

Given that not everything on a page need be relevant, we could extract just the relevant blocks of text and illustrations (e.g., paragraphs of text, panels within a figure, etc.) and treat that set of elements as the thing to cite. This is, of course, what Plazi are doing. The set of extracted blocks is glued together as a "treatment", assigned an identifier (often a DOI), and treated as a citable unit. It would be interesting to see to what extent these treatments are actually cited, for example, do subsequent revisions that cite work that include treatments cite those treatments, or just the work itself? Put another way, are we creating "threads" between taxonomic revisions?

One reason for these notes is that I'm exploring uploading taxonomic name - literature links to ChecklistBank and case 1 above is easy, as is case 3 (if we have treatment-level identifiers). But case 2 is problematic because we are linking to a set of things that may not have an identifier, which means a decision has to be made about which page to link to, and how to refer to that page.