iPhylo: May 2021

Roderic D. M. Page

Friday, May 28, 2021

Finding citations of specimens

Note to self.

The challenge of finding specimen citations in papers keeps coming around. It seems that this is basically the same problem as finding citations to papers, and can be approached in much the same way.

If you want to build a database of reference from scratch, one way is to scrape citations from papers (e.g., from the "literature cited" section), convert those strings into structured data, and add those to your database. In the early days of bibliographic searching this was a common strategy (and I still use it to help populate Wikidata).

Regular expressions are powerful but also brittle, you need to keep tweaking them to accommodate all the minor ways citation styles can differ. This leads to more sophisticated (and hopefully robust) approaches, such as machine learning. Conditional random fields (CRF) are a popular technique, pioneered by tools like Parscite and most recently used in the very elegant anystyle.io. You paste in a citation string and you get back that citation with all the component parts (authors, title, journal, pagination, etc.) separated out. Approaches like this require training data to teach the parser how to recognise the parts of a citation string. One obvious way to generate training data is to have a large bibliographic database, a set of "style sheets" describing all the ways different journals represent citations (e.g., citationstyles.org), and then you can generate lots of training data.

Over time the need for citation parsing has declined somewhat, being replaced by simple fulltext search (exemplified by this Tweet);

We don't need no stinkin' parser- a guide to resolving free-form citations with #CrossRef Metadata Search -> goo.gl/f9a4e
— Geoffrey Bilder (@gbilder) October 18, 2012

Again, in the early days a common method of bibliographic search was to search by keys such as journal name (or ISSN), volume number, and starting page. So you had to atomise the reference into its parts, then search for something that matched those parts. This is tedious (OpenURL anyone?), but helps reduce false matches. If you only have a small bibliographic database searching for reference by string matching can be frustrating because you are likely to get lots of matches, but none of them to the reference you are actually looking for. Given how search works you'll pretty much always get some sort of match. What really helps is if the database has the answer to your search (this is one reason Google is so great, if you have indexed the whole web chances are you have the answer somewhere already). Now that CrossRef's database has grown much larger you can search for a reference using a simple string search and be reasonably confident of getting the a genuine hit. The need to atomise a reference for searching is disappearing.

So, armed with a good database and good search tools we can avoid parsing references. Search also opens up other possibilities, such as finding citations using full text search. Given a reference how do you find where it's been cited? One approach is to parse the text of a reference (A), extract the papers in the "literature cited" section (B, C, D, etc.), match those to a database, and add the "A cites B", "A cites C", etc. links to the database. This will answer "what papers does A cite?" but not "what other papers cite C?". One approach to that question would be to simply take the reference C, convert it to a citation string, then blast through all the full text you could find looking for matches to that citation string - these are likely to be papers that cite reference C. In other words, you are finding that string in the "literature cited" section of the citing papers.

So, to summarise:

To recognise and extract citations as structured data from text we can use regular expressions and/or machine learning.
Training data for machine learning can be generated from existing bibliographic data coupled with rules for generating citation strings.
As bibliographic databases grow in size the need for extracting and parsing citations diminishes. Our databases will have most of the citations already, so that using search is enough to find what we want.
To build a citation database we can parse the literature cited section and extract all references cited by a paper ("X cites")
Another approach to building a citation database is to tackle the reverse question, namely "X is cited by". This can be done by a full text search for citation strings corresponding to X.

How does this relate to citing specimens you ask? Well, I think the parallels are very close:

We could use CRF approaches to have something like anystyle.io for specimens. Paste in a specimen from the "Materials examined" section of a paper and have it resolved into its component parts (e.g., collector, locality, date).
We have a LOT of training data in the form of GBIF. Just download data in Darwin Core format, apply various rules for how specimens are cited in the literature, and we have our training data.
Using our specimen parser we could process the "Materials examined" section of a paper to find the specimens (Plazi extracts specimens from papers, although it's not clear to me how automated this is.)
We could also do the reverse: take a Darwin Core Archive for, say, a single institution, generate all the specimen citation strings you'd expect to see people use in their papers, then go search through the full text of papers (e.g., in PubMed Central and BHL) looking for those strings - those are citations of your specimens.

There seems a lot of scope for learning from the experience of people working with bibliographic citations, especially how to build parsers, and the role that "stylesheets" could play in helping to understand how people cite specimens. Obviously, a lot of this would be unnecessary if there was a culture of using and citing persistent identifiers for specimens, but we seem to be a long way from that just yet.

Maximum entropy summary trees to display higher classifications

How to cite: Page, R. (2021). Maximum entropy summary trees to display higher classifications https://doi.org/10.59350/af01t-6sw74

A challenge in working with large taxonomic classifications is how you display them to the user, especially if the user probably doesn't want all the gory details. For example, the Field Guide app to Victorian Fauna has a nice menu of major animal groups:

This includes both taxonomic and ecological categories (e.g., terrestrial, freshwater, etc.) and greatly simplifies the animal tree of life, but it is a user-friendly way to start browsing a larger database of facts about animals. It would be nice if we could automate constructing such lists, especially for animal groups where the choices of what to display might not seem obvious (everyone wants to see birds, but what insects groups would you prioritise?).

One way to help automate these sort of lists is to use summary trees (see also Karloff, H., & Shirley, K. E. (2013). Maximum Entropy Summary Trees. Computer Graphics Forum, 32(3pt1), 71–80. doi:10.1111/cgf.12094). A summary tree takes a large tree and produces a small summary for k nodes, where k is a number that you supply. In other words, if you want your summary to have 10 nodes then k=10. The diagram below summarises an organisation chart for 43,134 employees.

Summary trees show only a subset of the nodes in the complete tree. All the nodes with a given parent that aren't displayed get aggregated into a newly created "others" node that is attached to that parent. Hence the summary tree alerts the user that there are nodes which list in the full tree but which aren't shown.

Code for maximum entropy summary trees is available in C and R from https://github.com/kshirley/summarytrees, so I've been playing with it a little (I don't normally use R but there was little choice here). As an example I created a simple tree for animals, based on the Catalogue of Life. I took a few phyla and classes and built a tree as a CSV file (see the gist). The file lists each node (uniquely numbered), its parent node (the parent of the root of the tree is "0"), a label, and a weight. For an internal node the weight is always 0, for a leaf the weight can be assigned in various ways. By default you could assign each leaf a weight of 1, but if the "leaf" node represents more than one thing (for example, the class Mammalia) then you can give it the number of species in that class (e.g., 5939). You could also assign weights based on some other measure, such as "popularity". In the gist I got bored and only added species counts for a few taxa, everything else was set to 1.

I then loaded the tree into R and found a summary tree for k=30 (the script is in the gist):

This doesn't look too bad (note as I said above, I didn't fill in all the actual species counts because reasons). If I wanted to convert this into a menu such as the one the Victoria Fauna app uses I would simply list the leaf nodes in order, skipping over those labelled "n others", which would give me:

Mammalia
Amphibia
Reptilia
Aves
Actinopterygii
Hemiptera
Hymenoptera
Lepidoptera
Diptera
Coleoptera
Arachnida
Acanthocephala
Nemertea
Rotifera
Porifera
Platyhelminthes
Nematoda
Mollusca

These 18 taxa are not a bad starting point for a menu, especially if we added pictures from PhyloPic to liven it up. There are probably a couple of animal groups that could be added to make it a bit more inclusive.

Because the technique is automated and fast, it would be straightforward to create submenus for major taxa, with the added advantage that you don't beed to make decisions based whether you know anything about that taxonomic group, it can be driven entirely by species counts (for example). We could also use other measures for weights, such as number of Google search hits, size of pages on Wikipedia, etc. So far I've barely scratched the surface of what could be done with this tool.

P.S. The R code is:

library(devtools)
install_github("kshirley/summarytrees", build_vignettes = TRUE)

library(summarytrees)

data = read.table('/Users/rpage/Development/summarytrees/animals.csv', header=TRUE,sep=",")

g <- greedy(node = data[, "node"], 
            parent = data[, "parent"], 
            weight = data[, "weight"], 
            label = data[, "label"], 
            K = 30)
            write.csv(g$summary.trees[[30]], '/Users/rpage/Development/summarytrees/summary.csv')

The gist has the data file, and a simple PHP program to convert the output into a dot file to be viewed with GraphViz.

Tuesday, May 18, 2021

Preprint on Wikidata and the bibliography of life

Last week I submitted a manuscript entitled "Wikidata and the bibliography of life". I've been thinking about the "bibliography of life" (AKA a database of every taxonomic publication ever published) for a while, and this paper explores the idea that Wikidata is the place to create this database. The preprint version is on bioRxiv (doi:10.1101/2021.05.04.442638). Here's the abstract:

Biological taxonomy rests on a long tail of publications spanning nearly three centuries. Not only is this literature vital to resolving disputes about taxonomy and nomenclature, for many species it represents a key source - indeed sometimes the only source - of information about that species. Unlike other disciplines such as biomedicine, the taxonomic community lacks a centralised, curated literature database (the “bibliography of life”). This paper argues that Wikidata can be that database as it has flexible and sophisticated models of bibliographic information, and an active community of people and programs (“bots”) adding, editing, and curating that information. The paper also describes a tool to visualise and explore bibliography information in Wikidata and how it links to both taxa and taxonomists.

The manuscript summarises some work I've been doing to populate Wikidata with taxonomic publications (building on a huge amount of work already done), and also describes ALEC which I use to visualise this content. I've made various (unreleased) knowledge graphs of taxonomic information (and one that I have actually released Ozymandias), I'm still torn between whether the future is to invest more effort in Wikidata, or construct lighter, faster, domain specific knowledge graphs for taxonomy. I think the answer is likely to be "yes".

Meantime, one chart I quite like from the submitted version of this paper is shown below.

It's a chart that is a bit tricky to interpret. My goal was to get a sense of whether bibliographic items added to Wikidata (e.g., taxonomic papers) were actually being edited by the Wikidata community, or whether they just sat there unchanged since they were added. If people are editing these publications, for example, by adding missing author names, linking papers to items for their authors, or adding additional identifiers (such as DOIs, ZooBank identifiers, etc.), then there is clear value in using Wikidata as a repository of bibliographic data. So I grabbed a sample of 1000 publications, retrieved their edit history from Wikidata, and plotted the creation timestamp of each item against the timestamps for each edit made to that item. If items were never edited then every point would fall along the diagonal line. If edits are made, they appear to the right of the diagonal. I could have just counted edits made, but I wanted to visualise those edits. As the chart shows, there is quite a lot of editing activity, so there his a community of people (and bots) curating this content. In many ways this is the strongest argument for using Wikidata for a "bibliography of life". Any database needs curation, which means people, and this is what Wikidata offers, a community of people who care about often esoteric details, and get pleasure from improving structured data.

There are still huge gaps in Wikidata's coverage of the taxonomic literature. Once you move beyond the "low hanging fruit" of publications with CrossRef DOIs the task of adding literature to Wikidata gets a bit more complicated. Then there is the reconciliation problem: given an existing taxonomic database with a list of references, how do we match those references to the corresponding items in Wikidata? There is still a lot to do.