Thursday, September 29, 2022

The ideal taxonomic journal

This is just some random notes on an “ideal” taxonomic journal, inspired in part by some recent discussions on “turbo-taxonomy” (e.g., https://doi.org/10.3897/zookeys.1087.76720 and https://doi.org/10.1186/1742-9994-10-15), and also examples such as the Australian Journal of Taxonomy https://doi.org/10.54102/ajt.qxi3r which seems well-intentioned but limited.

XML

One approach is to have highly structured text that embeds detailed markup, and ideally a tool that generates markup in XML. This is the approach taken by Pensoft. There is an inevitable trade-off between the burden on authors of marking up text versus making the paper machine readable. In some ways this seems misplaced effort given that there is little evidence that publications by themselves have much value (see The Business of Extracting Knowledge from Academic Publications). “Value” in this case means as a source of data or factual statements that we can compute over. Human-readable text is not a good way to convey this sort of information.

It’s also interesting that many editing tools are going in the opposite direction, for example there are minimalist tools using Markdown where the goal is to get out of the author’s way, rather than impose a way of writing. Text is written by humans for humans, so the tools should be human-friendly.

The idea of publishing using XML is attractive in that it gives you XML that can be archived by, say, PubMed Central, but other than that the value seems limited. A cursory glance at download stats for journals that provide PDF and XML downloads, such as PLoS One and ZooKeys, PDF is by far the more popular format. So arguably there is little value in providing XML. Those who have tried to use JATS-XML as an authoring tool have not had a happy time: How we tried to JATS XML. However, there are various tools to help with the process, such as docxToJats,
texture, and jats-xml-to-pdf if this is the route one wants to take.

Automating writing manuscripts

The dream, of course, is to have a tool where you store all your taxonomic data (literature, specimens, characters, images, sequences, media files, etc.) and at the click of a button generate a paper. Certainly some of this can be automated, much nomenclatural and specimen information could be converted to human-readable text. Ideally this computer-generated text would not be edited (otherwise it could get out of sync with the underlying data). The text should be transcluded. As an aside, one way to do this would be to include things such as lists of material examined as images rather than text while the manuscript is being edited. In the same way that you (probably) wouldn’t edit a photograph within your text editor, you shouldn’t be editing data. When the manuscript is published the data-generated portions can then be output as text.

Of course all of this assumes that we have taxonomic data in a database (or some other storage format, including plain text and Mark-down, e.g. Obsidian, markdown, and taxonomic trees) that can generate outputs in the various formats that we need.

Archiving data and images

One of the really nice things that Plazi do is have a pipeline that sends taxonomic descriptions and images to Zenodo, and similar data to GBIF. Any taxonomic journal should be able to do this. Indeed, arguably each taxonomic treatment within the paper should be linked to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately having treatments as transclusions within the larger manuscript. Alternatively we could store the treatments as parts of the larger article (rather like chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether these treatments are as important as we make out, see Does anyone cite taxonomic treatments?. But having machine-readable taxonomic data archived and accessible is a good thing. Uploading the same data to GBIF makes much of that data immediately accessible. Now that GBIF offers hosted portals there is the possibility of having custom interfaces to data from a particular journal.

Name and identifier registration

We would also want automatic registration of new taxonomic names, for which there are pipelines (see “A common registration-to-publication automated pipeline for nomenclatural acts for higher plants (International Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)” https://doi.org/10.3897/zookeys.550.9551). These pipelines do not seem to be documented in much detail, and the data formats differ across registration agencies (e.g., IPNI and ZooBank). For example, ZooBank seems to require TaxPub XML.

Registration of names and identifiers, especially across multiple registration agencies (ZooBank, CrossRef, DataCite, etc.) requires some coordination, especially when one registration agency requires identifiers from another.

Summary

If data is key, then the taxonomic paper itself becomes something of a wrapper around that data. It still serves the function of being human-readable, providing broader context for the work, and as an archive that conforms to currently accepted ways to publish taxonomic names. But in some ways it is the last interesting part of the process.

Written with StackEdit.

Wednesday, September 14, 2022

DNA barcoding as intergenerational transfer of taxonomic knowledge

I tweeted about this but want to bookmark it for later as well. The paper “A molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510 contains the following:

…the annotated barcode records assembled by FinBOL participants represent a tremendous intergenerational transfer of taxonomic knowledge … the time contributed by current taxonomists in identifying and contributing voucher specimens represents a great gift to future generations who will benefit from their expertise when they are no longer able to process new material.

I think this is a very clever way to characterise the project. In an age of machine learning this may be commonest way to share knowledge , namely as expert-labelled training data used to build tools for others. Of course, this means the expertise itself may be lost, which has implications for updating the models if the data isn’t complete. But it speaks to Charles Godfrey’s theme of “Taxonomy as information science”.

Note that the knowledge is also transformed in the sense that the underlying expertise of interpreting morphology, ecology, behaviour, genomics, and the past literature is not what is being passed on. Instead it is probabilities that a DNA sequence belongs to a particular taxon.

This feels is different to, say iNaturalist, where there is a machine learning model to identify images. In that case, the model is built on something the community itself has created, and continues to create. Yes, the underlying idea is that same: “experts” have labelled the data, a model is trained, the model is used. But the benefits of the iNaturalist model are immediately applicable to the people whose data built the model. In the case of barcoding, because the technology itself is still not in the hands of many (relative to, say, digital imaging), the benefits are perhaps less tangible. Obviously researchers working with environmental DNA will find it very useful, but broader impact may await the arrival of citizen science DNA barcoding.

The other consideration is whether the barcoding helps taxonomists. Is it to be used to help prioritise future work (“we are getting lots of unknown sequences in these taxa, lets do some taxonomy there”), or is it simply capturing the knowledge of a generation that won’t be replaced:

The need to capture such knowledge is essential because there are, for example, no young Finnish taxonomists who can critically identify species in many key groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges, most mite lineages).

The cycle of collect data, test and refine model, collect more data, rinse and repeat that happens with iNaturalist creates a feedback loop. It’s not clear that a similar cycle exists for DNA barcoding.

Written with StackEdit.

Thursday, September 08, 2022

Local global identifiers for decentralised wikis

I've been thinking a bit about how one could use a Markdown wiki-like tool such as Obsidian to work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).

One "gotcha" would be how to name pages. If we treat the database as entirely local, then the page names don't matter, but what if we envisage sharing the database, or merging it with others (for example, if we divided a taxon up into chunks, and different people worked on those different chunks)?

This is the attraction of globally unique identifiers. You and I can independently work on the same thing, such as data linked to scientific paper, safe in the knowledge that if we both use the DOI for that paper we can easily combine what we've done. But global identifiers can also be a pain, especially if we need to use a service to look them up ("is there a DOI for this paper?", "what is the LSID for this taxonomic name?").

Life would be easier if we could generate identifiers "locally", but had some assurance that they would be globally unique, and that anyone else generating an identifier for the same thing would arrive at the same identifier (this eliminates things such as UUIDs which are intentionally designed to prvent people genrrating the same identifier). One approach is "content addressing" (see, e.g. Principles of Content Addressing - dead link but in the Wayabck Machine, see also btrask/stronglink). For example, we can generate a cryptographic hash of a file (such as a PDF) and use that as the identifier.

Now the problem is that we have globally unique, but ugly and unfriendly identifiers (such as "6c98136eba9084ea9a5fc0b7693fed8648014505"). What we need are nice, easy to use identifiers we can use as page names. Wikispecies serves as a possible role model, where taxon names serve as page names, as do simplified citations (e.g., authors and years). This model runs into the problem that taxon names aren't unique, nor are author + year combinations. In Wikispecies this is resolved by having a centralised database where it's first come, first served. If there is a name clash you have to create a new name for your page. This works, but what if you have multiple databases un by different people? How do we ensure the identifiers are the same?

Then I remembered Roger Hyam's flight of fantasy over a decade ago: SpeciesIndex.org – an impractical, practical solution. He proposed the following rules to generate a unique URI for a taxonomic name:

  • The URI must start with "http://speciesindex.org" followed by one or more of the following separated by slashes.
  • First word of name. Must only contain letters. Must not be the same as one of the names of the nomenclatural codes (icbn or iczn). Optional but highly recommended.
  • Second word of name. Must only contain letters and not be a nomenclatural code name. Optional.
  • Third word of name. Must only contain letters and not be a nomenclatural code name. Optional.
  • Year of publication. Must be an integer greater than 1650 and equal to or less than the current year. If this is an ICZN name then this should be the year the species (epithet) was published as is commonly cited after the name. If this is an ICBN name at species or below then it is the date of the combination. Optional. Recommended for zoological names if known. Not recommended for botanical names unless there is a known problem with homonyms in use by non-taxonomists.
  • Nomenclatural code governing the name of the taxon. Currently this must be either 'icbn' or 'iczn'. This may be omitted if the code is unknown or not relevant. Other codes may be added to this list.
  • Qualifier This must be a Version 4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for a taxon for which the conventional name is unknown or does not exist or to indicate a particular taxon concept that bears the embedded name.
  • The whole speciesindex.org URI string should be considered case sensitive. Everything should be lower case apart from the first letter of words that are specified as having upper case in their relevant codes e.g. names at and above the rank of genus.

Roger is basically arging that while names aren't unique (i.e., we have homonyms such as Abronia) they are pretty close to being so, and with a few tweaks we can come up with a unique representation. Another way to think about this if we had a database of all taxonomics, we could construct a trie and for each name find the shortest set of name parts (genus, species, etc), year, and code that gave us a unique string for that name. In many cases the species name may be all we need, in other cases we may need to add year and/or nomenclatural code to arrive at a unique string.

What about bibliographic references? Well many of us will have databases (e.g., Endnote, Mendeley, Zotero, etc.) which generate "cite keys". These are typically short, memorable identifiers for a reference that are unique within that database. There is an interesting discussion on the JabRef forum regarding a "Universal Citekey Generator", and source code is available cparnot/universal-citekey-js. I've yet to explore this in detail, but it looks a promising way to generate unique identifiers from basic metadata (echos of more elaborate schemes such as SICIs). For example,

Senna AR, Guedes UN, Andrade LF, Pereira-Filho GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda: Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.
becomes "Senna:2021ck". So if two people have the same, core, metadata for a paper they can generate the same key.

Hence it seems with a few conventions (and maybe some simple tools to support them) we could have decentralised wiki-like tools that used the same identifiers for the same things, and yet those identfiiers were short and human-friendly.

Thursday, September 01, 2022

Does anyone cite taxonomic treatments?

Taxonomic treatments have come up in various discussions I'm involved in, and I'm curious as to whether they are actually being used, in particular, whether they are actually being cited. Consider the following quote:
The taxa are described in taxonomic treatments, well defined sections of scientific publications (Catapano 2019). They include a nomenclatural section and one or more sections including descriptions, material citations referring to studied specimens, or notes ecology and behavior. In case the treatment does not describe a new discovered taxon, previous treatments are cited in the form of treatment citations. This citation can refer to a previous treatment and add additional data, or it can be a statement synonymizing the taxon with another taxon. This allows building a citation network, and ultimately is a constituent part of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects https://doi.org/10.3897/rio.8.e93709

"Traditional" academic citation is from article to article. For example, consider these two papers:

Li Y, Li S, Lin Y (2021) Taxonomic study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae). ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935
Miller J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan, Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers. ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160

Li et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken the citation such that it does appear correctly either on their web page or in CrossRef).

So, we have this link: [article]10.3897/zookeys.1072.67935 --cites--> [article]10.3897/zookeys.11.160. One article cites another.

In their 2021 paper Li et al. discuss Patu jidanweishi Miller, Griswold & Yin, 2009:

There is a treatment for the original description of Patu jidanweishi at https://doi.org/10.5281/zenodo.3792232, which was created by Plazi with a time stamp "2020-05-06T04:59:53.278684+00:00". The original publication date was 2009, the treatments are being added retrospectively.

In an ideal world my expectation would be that Li et al. 2021 would have cited the treatment, instead of just providing the text string "Patu jidanweishi Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F, 70A–F and 71A–F (♂♀)." Isn't the expectation under the treatment model that we would have seen this relationship:

[article]10.3897/zookeys.1072.67935 --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232

Furthermore, if it is the case that "[i]n case the treatment does not describe a new discovered taxon, previous treatments are cited in the form of treatment citations" then we should also see a citation between treatments, in other words Li et al.'s 2021 treatment of Patu jidanweishi (which doesn't seem to have a DOI but is available on Plazi' web site as https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74) should also cite the original treatment? It doesn't - but it does cite the Miller et al. paper.

So in this example we don't see articles citing treatments, nor do we see treatments citing treatments. Playing Devil's advocate, why then do we have treatments? Does't the lack of citations suggest that - despite some taxonomists saying this is the unit that matters - they actually don't. If we pay attention to what people do rather than what they say they do, they cite articles.

Now, there are all sorts of reasons why we don't see [article] -> [treatment] citations, or [treatment] -> [treatment] citations. Treatments are being added after the fact by Plazi, not by the authors of the original work. And in many cases the treatments that could be cited haven't appeared until after that potentially citing work was published. In the example above the Miller et al. paper dates from 2009, but the treatment extracted only went online in 2020. And while there is a long standing culture of citing publications (ideally using DOIs) there isn't an equivalent culture of citing treatments (beyond the simple text strings).

Obviously this is but one example. I'd need to do some exploration of the citation graph to get a better sense of citations patterns, perhaps using CrossRef's event data. But my sense is that taxonomists don't cite treatments.

I'm guessing Plazi would respond by saying treatments are cited, for example (indirectly) in GBIF downloads. This is true, although arguably people aren't citing the treatment, they're citing specimen data in those treatments, and that specimen data could be extracted at the level of articles rather than treatments. In other words, it's not the treatments themselves that people are citing.

To be clear, I think there is value in being able to identify those "well defined sections" of a publication that deal with a given taxon (i.e., treatments), but it's not clear to me that these are actually the citable units people might hope them to be. Likewise, journals such as ZooKeys have DOIs for individual figures. Does anyone actually cite those?

Wednesday, August 24, 2022

Can we use the citation graph to measure the quality of a taxonomic database?

More arm-waving notes on taxonomic databases. I've started to add data to ChecklistBank and this has got me thinking about the issue of data quality. When you add data to ChecklistBank you are asked to give a measure of confidence based on the Catalogue of Life Checklist Confidence system of one - five stars: ★ - ★★★★★. I'm scepetical about the notion of confidence or "trust" when it is reduced to a star system (see also Can you trust EOL?). I could literally pick any number of stars, there's no way to measure what number of stars is appropriate. This feeds into my biggest reservation about the Catalogue of Life, it's almost entirely authority based, not evidence based. That is, rather than give us evidence for why a particular taxon is valid, we are (mostly) just given a list of taxa are asked to accept those as gospel, based on assertions by one or more authorities. I'm not necessarly doubting the knowledge of those making these lists, it's just that I think we need to do better than "these are the accepted taxa because I say so" implict in the Catalogue of Life.

So, is there any way we could objectively measure the quality of a particular taxonomic checklist? Since I have a long standing interest in link the primary taxonomic litertaure to names in databases (since that's where the evidence is), I keep wondering whether measures based on that literture could be developed.

I recently revisited the fascinating (and quite old) literature on rates of synonymy:

Gaston Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the biodiversity crisisProc. R. Soc. Lond. B.251139–142 http://doi.org/10.1098/rspb.1993.0020
Andrew R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy, Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, https://doi.org/10.1093/sysbio/44.1.93

A key point these papers make is that the observed rate of synonymy is quite high (that is, many "new species" end up being merged with already known species), and that because it can take time to discover that a species is a synonym the actual rate may be even higher. In other words, in diagrams like the one reproduced below, the reason the proportion of synonyms declines the nearer we get to the present day (this paper came out in 1995) is not because are are creating fewer synonyms but because we've not yet had time to do the work to uncover the remaining synonyms.

Put another way, these papers are arguing that real work of taxonomy is revision, not species discovery, especially since it's not uncommon for > 50% of species in a taxon to end up being synonymised. Indeed, if a taxoomic group has few synonyms then these authors would argue that's a sign of neglect. More revisionary work would likely uncover additional synonyms. So, what we need is a way to measure the amount of research on a taxonomic group. It occurs to me that we could use the citation graph as a way to tackle this. Lets imagine we have a set of taxa (say a family) and we have all the papers that described new species or undertook revisions (or both). The extensiveness of that work could be measured by the citation graph. For example, build the citation graph for those papers. How many original species decsriptions are not cited? Those species have been potentially neglected. How many large-scale revisions have there been (as measured by the numbers of taxonomic papers those revisions cite)? There are some interesting approaches to quantifying this, such as using hubs and authorities.

I'm aware that taxonomists have not had the happiest relationship with citations:

Pinto ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617. doi: 10.1098/rsos.201617. PMID: 33972859; PMCID: PMC8074659.
Still, I think there is an intriguing possibility here. For this approach to work, we need to have linked taxonomic names to publications, and have citation data for those publications. This is happening on various platforms. Wikidata, for example, is becoming a repository of the taxonomic literature, some of it with citation links.
Page RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 https://doi.org/10.7717/peerj.13712
Time for some experiments.

Monday, August 22, 2022

Linking taxonomic names to the literature

Just some thoughts as I work through some datasets linking taxonomic names to the literature.

In the diagram above I've tried to capture the different situatios I encounter. Much of the work I've done on this has focussed on case 1 in the diagram: I want to link a taxonomic name to an identifier for the work in which that name was published. In practise this means linking names to DOIs. This has the advantage of linking to a citable indentifier, raising questions such as whether citations of taxonmic papers by taxonomic databases could become part of a taxonomist's Google Scholar profile.

In many taxonomic databases full work-level citations are not the norm, instead taxonomists cite one or more pages within a work that are relevant to a taxonomic name. These "microcitations" (what the U.S. legal profession refer to as "point citations" or "pincites", see What are pincites, pinpoints, or jump legal references?) require some work to map to the work itself (which is typically the thing that has a citatble identifier such as a DOI).

Microcitations (case 2 in the diagram above) can be quite complex. Some might simply mention a single page, but others might list a series of (not necessarily contiguous) pages, as well as figures, plates etc. Converting these to citable identifiers can be tricky, especially as in most cases we don't have page-level identifiers. The Biodiversity Heritage Library (BHL) does have URLs for each scanned page, and we have a standard for referring to pages in a PDF (page=<pageNum>, see RFC 8118). But how do we refer to a set of pages? Do we pick the first page? Do we try and represent a set of pages, and if so, how?

Another issue with page-level identifiers is that not everything on a given page may be relevant to the taxonomic name. In case 2 above I've shaded in the parts of the pages and figure that refer to the taxonomic name. An example where this can be problematic is the recent test case I created for BHL where a page image was included for the taxonomic name Aphrophora impressa. The image includes the species description and a illustration, as well as text that relates to other species.

Given that not everything on a page need be relevant, we could extract just the relevant blocks of text and illustrations (e.g., paragraphs of text, panels within a figure, etc.) and treat that set of elements as the thing to cite. This is, of course, what Plazi are doing. The set of extracted blocks is glued together as a "treatment", assigned an identifier (often a DOI), and treated as a citable unit. It would be interesting to see to what extent these treatments are actually cited, for example, do subsequent revisions that cite work that include treatments cite those treatments, or just the work itself? Put another way, are we creating "threads" between taxonomic revisions?

One reason for these notes is that I'm exploring uploading taxonomic name - literature links to ChecklistBank and case 1 above is easy, as is case 3 (if we have treatment-level identifiers). But case 2 is problematic because we are linking to a set of things that may not have an identifier, which means a decision has to be made about which page to link to, and how to refer to that page.

Wednesday, August 03, 2022

Papers citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life

Quick notes to self following on from a conversation about linking taxonomic names to the literature. There are different sorts of citation:
  1. Paper cites another paper
  2. Paper cites a dataset
  3. Dataset cites a paper
Citation type (1) is largely a solved problem (although there are issues of the ownership and use of this data, see e.g. Zootaxa has no impact factor. Citation type (2) is becoming more widespread (but not perfect as GBIF's #citethedoi campaign demonstrates. But the idea is well accepted and there are guides to how to do it, e.g.:
Cousijn, H., Kenall, A., Ganley, E. et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259 (2018). https://doi.org/10.1038/sdata.2018.259
However, things do get problematic because most (but not all) DOIs for publications are managed by CrossRef, which has an extensive citation database linking papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages its own citations links, but as far as I'm aware these two systems don't really taklk to each other. Citation type (3) is the case where a database is largely based on the literature, which applies to taxonomy. Taxonomic databases are essentially collections of literature that have opinions on taxa, and the database may simply compile those (e.g., a nomenclator), or come to some view on the applicability of each name. In an ideal would, each reference included in a taxonomic database would gain a citation, which would help better reflect the value of that work (a long standing bone of contention for taxonomists). It would be interesting to explore these issues further. CrossRef and DataCite do share Event Data (see also DataCite Event Data). Can this track citations of papers by a dataset? My take on Wayne's question:
Is there a way to turn those links into countable citations (even if just one per database) for Google Scholar?
is that he's is after type 3 citations, which I don't think we have a way to handle just yet (but I'd need to look at Event Data a bit more). Google Scholar is a black box, and the academic coimmunity's reliance on it for metrics is troubling. But it would be interetsing to try and figure out if there is a way to get Google Scholar to index the citations of taxonomic papers by databases. For instance, the Catalogue of Life has an ISSN 2405-884X so it can be treated as a publication. At the moment its web pages have lots of identifiers for people managing data and their organisations (lots of ORCIDs and RORs, and DOIs for individual datasets (e.g., checklistbank.org) but precious little in the way of DOIs for publications (or, indeed, ORCIDs for taxonomists). What would it take for taxonomic publications in the Catalogue of Life to be treated as first class citations?

Friday, May 27, 2022

Round trip from identifiers to citations and back again

Note to self (basically rewriting last year's Finding citations of specimens).

Bibliographic data supports going from identifier to citation string and back again, so we can do a "round trip."

1.

Given a DOI we can get structured data with a simple HTTP fetch, then use a tool such as citation.js to convert that data into a human-readable string in a variety of formats.

Identifier Structured data Human readable string
10.7717/peerj-cs.214 HTTP with content-negotiation CSL-JSON CSL templates Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

2.

Going in the reverse direction (string to identifier) is a little more challenging. In the "old days" a typical strategy was to attempt to parse the citation string into structured data (see AnyStyle for a nice example of this), then we could extract a truple of (journal, volume, starting page) and use that to query CrossRef to find if there was an article with that tuple, which gave us the DOI.

Identifier Structured data Human readable string
10.7717/peerj-cs.214 OpenURL query journal, volume, start page Citation parser Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

3.

Another strategy is to take all the citations strings for each DOI, index those in a search engine, then just use a simple search to find the best match to your citation string, and hence the DOI. This is what https://search.crossref.org does.

Identifier Human readable string
10.7717/peerj-cs.214 search Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

At the moment my work on material citations (i.e., lists of specimens in taxonomic papers) is focussing on 1 (generating citations from specimen data in GBIF) and 2 (parsing citations into structured data).

Wednesday, May 11, 2022

Thoughts on TreeBASE dying(?)

So it looks like TreeBASE is in trouble, it's legacy Java code a victim of security issues. Perhaps this is a chance to rethink TreeBASE, assuming that a repository of published phylogenies is still considered a worthwhile thing to have (and I think that question is open).

Here's what I think could be done.

  1. The data (individual studies with trees and data) are packaged into whatever format is easiest (NEXUS, XML, JSON) and uploaded to a repository such as Zenodo for long term storage. They get DOIs for citability. This becomes the default storage for TreeBASE.
  2. The data is transformed into JSON and indexed using Elasticsearch. A simple web interface is placed on top so that people can easily find trees (never a strong point of the original TreeBASE). Trees are displayed natively on the web using SVG. The number one goal is for people to be able to find trees, view them, and download them.
  3. To add data to TreeBASE the easiest way would be for people to upload them direct to Zenodo and tag them "treebase". A bot then grabs a feed of these datasets and adds them to the search engine in (1) above. As time allows, add an interface where people upload data directly, it gets curated, then deposited in Zenodo. This presupposes that there are people available to do curation. Maybe have "stars" for the level of curation so that users know whether anyone has checked the data.

There's lots of details to tweak, for example how many of the existing URLs for studies are preserved (some URL mapping), and what about the API? And I'm unclear about the relationship with Dryad.

My sense is that the TreeBASE code is very much of its time (10-15 years ago), a monolithic block of code with SQL, Java, etc. If one was starting from scratch today I don't think this would be the obvious solution. Things have trended towards being simpler, with lots of building blocks now available in the cloud. Need a search engine? Just spin up a container in the cloud and you have one. More and more functionality can be devolved elsewhere.

Another other issue is how to support TreeBASE. It has essentially been a volunteer effort to date, with little or no funding. One reason I think having Zenodo as a storage engine is that it takes care of long term sustainability of the data.

I realise that this is all wild arm waving, but maybe now is the time to reinvent TreeBASE?

Updates

It's been a while since I've paid a lot of attention to phylogenetic databases, and it shows. There is a file-based storage system for phylogenies phylesystem (see "Phylesystem: a git-based data store for community-curated phylogenetic estimates" https://doi.org/10.1093/bioinformatics/btv276) that is sort of what I had in mind, although long term persistence is based on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly horrible-looking JSON transformation of NeXML (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of NeXML or a JSON transformation seems the obvious storage format. It will probably need some cleaning and simplification if it is to be indexed easily. Looking back over the long history of TreeBASE and phylogenetic databases I'm struck by how much complexity has been introduced over time. I think the tech has gotten in the way sometimes (which might just be another way of saying that I'm not smart enough to make sense of it all.

So we could imagine a search engine that covers both TreeBASE and Open Tree of Life studies.

Basic metadata-based searches would be straightforward, and we could have a user interface that highlights the trees (I think TreeBASE's biggest search rival is a Google image search). The harder problem is searching by tree structure, for which there is an interesting literature without any decent implementations that I'm aware of (as I said, I've been out of this field a while).

So my instinct is we could go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need to think a bit more cleverly about higher taxon and tree based searching. I've always thought that one killer query would be not so much "show me all the trees for my taxon" but "show me a synthesis of the trees for my taxon". Imagine a supertree of recent studies that we could use as a summary of our current knowledge, or a visualisation that summarises where there are conflicts among the trees.

Relevant code and sites

Thursday, April 07, 2022

Obsidian, markdown, and taxonomic trees

Returning to the subject of personal knowledge graphs Kyle Scheer has an interesting repository of Markdown files that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines (see his blog post for more background).

If you add these files to Obsidian you get a nice visualisation of a taxonomy of academic disciplines. The applications of this to biological taxonomy seem obvious, especially as a tool like Obsidian enables all sorts of interesting links to be added (e.g., we could add links to the taxonomic research behind each node in the taxonomic tree, the people doing that research, etc. - although that would mean we'd no longer have a simple tree).

The more I look at these sort of simple Markdown-based tools the more I wonder whether we could make more use of them to create simple but persistent databases. Text files seem the most stable, long-lived digital format around, maybe this would be a way to minimise the inevitable obsolescence of database and server software. Time for some experiments I feel... can we take a taxonomic group, such as mammals, and create a richly connected database purely in Markdown?

Tuesday, February 08, 2022

Duplicate DOIs (again)

This blog post provides some background to a recent tweet where I expressed my frustration about the duplication of DOIs for the same article. I'm going to document the details here.

The DOI that alerted me to this problem is https://doi.org/10.2307/2436688 which is for the article

Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2), 64–67.

This article is hosted by JSTOR at https://www.jstor.org/stable/2436688 which displays the DOI https://doi.org/10.2307/2436688 .

This same article is also hosted by Wiley at https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x with the DOI https://doi.org/10.1002/j.1537-2197.1940.tb14217.x.

Expected behaviour

What should happen is if Wiley is going to be the publisher of this content (taking over from JSTOR), the DOI 10.2307/2436688 should be redirected to the Wiley page, and the Wiley page displays this DOI (i.e., 10.2307/2436688). If I want to get metadata for this DOI, I should be able to use CrossRef's API to retrieve that metadata, e.g. https://api.crossref.org/v1/works/10.2307/2436688 should return metadata for the article.

What actually happens

Wiley display the same article on their web site with the DOI 10.1002/j.1537-2197.1940.tb14217.x. They have minted a new DOI for the same article! The original JSTOR DOI now resolves to the Wiley page (you can see this using the Handle Resolver), which is what is supposed to happen. However, Wiley should have reused the original DOI rather than mint their own.

Furthermore, while the original DOI still resolves in a web browser, I can't retrieve metadata about that DOI from CrossRef, so any attempt to build upon that DOI fails. However, I can retrieve metadata for the Wiley DOI, i.e. https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x works, but https://api.crossref.org/v1/works/10.2307/2436688 doesn't.

Why does this matter?

For anyone using DOIs as stable links to the literature the persistence of DOIs is something you should be able to rely upon, both for people clicking on links in web browsers and developers getting metadata from those DOIs. The whole rationale of the DOI system is a single, globally unique identifier for each article, and that these DOIs persist even when the publisher of the content changes. If this property doesn't hold, then why would a developer such as myself invest effort in linking using DOIs?

Just for the record, I think CrossRef is great and is a hugely important part of the scholarly landscape. There are lots of things that I do that would be nearly impossible without CrossRef and its tools. But cases like this where we get massive duplication of DOIs when a publishers takes over an existing journal fundamentally breaks the underlying model of stable, persistent identifiers.

Thursday, February 03, 2022

Deduplicating bibliographic data

There are several instances where I have a collection of references that I want to deduplicate and merge. For example, in Zootaxa has no impact factor I describe a dataset of the literature cited by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4), as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1). Given that the same articles may be cited many times, these datasets have lots of duplicates. Similarly, articles in Wikispecies often have extensive lists of references cited, and the same reference may appear on multiple pages (for an initial attempt to extract these references see https://doi.org/10.5281/zenodo.5801661 and https://github.com/rdmpage/wikispecies-parser).

There are several reasons I want to merge these references. If I want to build a citation graph for Zootaxa or Phytotaxa I need to merge references that are the same so that I can accurate count citations. I am also interested in harvesting the metadata to help find those articles in the Biodiversity Heritage Library (BHL), and the literature cited section of scientific articles is a potential goldmine of bibliographic metadata, as is Wikispecies.

After various experiments and false starts I've created a repository https://github.com/rdmpage/bib-dedup to host a series of PHP scripts to deduplicate bibliographics data. I've settled on using CSL-JSON as the format for bibliographic data. Because deduplication relies on comparing pairs of references, the standard format for most of the scripts is a JSON array containing a pair of CSL-JSON objects to compare. Below are the steps the code takes.

Generating pairs to compare

The first step is to take a list of references and generate the pairs that will be compared. I started with this approach as I wanted to explore machine learning and wanted a simple format for training data, such as an array of two CSL-JSON objects and an integer flag representing whether the two references were the same of different.

There are various ways to generate CSL-JSON for a reference. I use a tool I wrote (see Citation parsing tool released) that has a simple API where you parse one or more references and it returns that reference as structured data in CSL-JSON.

Attempting to do all possible pairwise comparisons rapidly gets impractical as the number of references increases, so we need some way to restrict the number of comparisons we make. One approach I've explored is the “sorted neighbourhood method” where we sort the references 9for example by their title) then move a sliding window down the list of references, comparing all references within that window. This greatly reduces the number of pairwise comparisons. So the first step is to sort the references, then run a sliding window over them, output all the pairs in each window (ignoring in pairwise comparisons already made in a previous window). Other methods of "blocking" could also be used, such as only including references in a particular year, or a particular journal.

So, the output of this step is a set of JSON arrays, each with a pair of references in CSL-JSON format. Each array is stored on a single line in the same file in line-delimited JSON (JSONL).

Comparing pairs

The next step is to compare each pair of references and decide whether they are a match or not. Initially I explored a machine learning approach used in the following paper:

Wilson DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: The 2011 International Joint Conference on Neural Networks. 9–14. DOI: 10.1109/IJCNN.2011.6033192

Initial experiments using https://github.com/jtet/Perceptron were promising and I want to play with this further, but I deciding to skip this for now and just use simple string comparison. So for each CSL-JSON object I generate a citation string in the same format using CiteProc, then compute the Levenshtein distance between the two strings. By normalising this distance by the length of the two strings being compared I can use an arbitrary threshold to decide if the references are the same or not.

Clustering

For this step we read the JSONL file produced above and record whether the two references are a match or not. Assuming each reference has a unique identifier (needs only be unique within the file) then we can use those identifier to record the clusters each reference belongs to. I do this using a Disjoint-set data structure. For each reference start with a graph where each node represents a reference, and each node has a pointer to a parent node. Initially the reference is its own parent. A simple implementation is to have an array index by reference identifiers and where the value of each cell in the array is the node's parent.

As we discover pairs we update the parents of the nodes to reflect this, such that once all the comparisons are done we have a one or more sets of clusters corresponding to the references that we think are the same. Another way to think of this is that we are getting the components of a graph where each node is a reference and pair of references that match are connected by an edge.

In the code I'm using I write this graph in Trivial Graph Format (TGF) which can be visualised using a tools such as yEd.

Merging

Now that we have a graph representing the sets of references that we think are the same we need to merge them. This is where things get interesting as the references are similar (by definition) but may differ in some details. The paper below describes a simple Bayesian approach for merging records:

Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: 10.1145/1141753.1141817.

So the next step is to read the graph with the clusters, generate the sets of bibliographic references that correspond to each cluster, then use the method described in Councill et al. to produce a single bibliographic record for that cluster. These records could then be used to, say locate the corresponding article in BHL, or populate Wikidata with missing references.

Obviously there is always the potential for errors, such as trying to merge references that are not the same. As a quick and dirty check I flag as dubious any cluster where the page numbers vary among members of the cluster. More sophisticated checks are possible, especially if I go down the ML route (i.e., I would have evidence for the probability that the same reference can disagree on some aspects of metadata).

Summary

At this stage the code is working well enough for me to play with and explore some example datasets. The focus is on structured bibliographic metadata, but I may simplify things and have a version that handles simple string matching, for example to cluster together different abbreviations of the same journal name.

Sunday, January 02, 2022

Large graph viewer experiments

I keep returning to the problem of viewing large graphs and trees, which means my hard drive has accumulated lots of failed prototypes. Inspired by some recent discussions on comparing taxonomic classifications I decided to package one of these (wildly incomplete) prototypes up so that I can document the idea and put the code somewhere safe.

Google Maps-like viewer

I've created a simple viewer that uses a tiled map viewer (like Google Maps) to display a large graph. The idea is to draw the entire graph scaled to a 256 x 256 pixel tile. The graph is stored in a database that supports geospatial queries, which means the queries to retrieve the individual tiles need to display the graph at different levels of resolution are simply bounding box queries to a database. I realise that this description is cryptic at best. The GitHub repository https://github.com/rdmpage/gml-viewer has more details and the code itself. There's a lot to do, especially adding support for labels(!) which presents some interesting challenges (levels of detail and generalization). The code doesn't do any layout of the graph itself, instead I've used the yEd tool to compute the x,y coordinates of the graph.

Since this exercise was inspired by a discussion of the ASM Mammal Diversity Database, the graph I've used for the demonstration above is the ASM classification of extant mammals. I guess I need to solve the labelling issue fairly quickly!