iPhylo

Roderic D. M. Page

Thursday, April 07, 2022

Obsidian, markdown, and taxonomic trees

Returning to the subject of personal knowledge graphs Kyle Scheer has an interesting repository of Markdown files that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines (see his blog post for more background).

If you add these files to Obsidian you get a nice visualisation of a taxonomy of academic disciplines. The applications of this to biological taxonomy seem obvious, especially as a tool like Obsidian enables all sorts of interesting links to be added (e.g., we could add links to the taxonomic research behind each node in the taxonomic tree, the people doing that research, etc. - although that would mean we'd no longer have a simple tree).

The more I look at these sort of simple Markdown-based tools the more I wonder whether we could make more use of them to create simple but persistent databases. Text files seem the most stable, long-lived digital format around, maybe this would be a way to minimise the inevitable obsolescence of database and server software. Time for some experiments I feel... can we take a taxonomic group, such as mammals, and create a richly connected database purely in Markdown?

Tuesday, February 08, 2022

Duplicate DOIs (again)

This blog post provides some background to a recent tweet where I expressed my frustration about the duplication of DOIs for the same article. I'm going to document the details here.

The DOI that alerted me to this problem is https://doi.org/10.2307/2436688 which is for the article

Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2), 64–67.

This article is hosted by JSTOR at https://www.jstor.org/stable/2436688 which displays the DOI https://doi.org/10.2307/2436688 .

This same article is also hosted by Wiley at https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x with the DOI https://doi.org/10.1002/j.1537-2197.1940.tb14217.x.

Expected behaviour

What should happen is if Wiley is going to be the publisher of this content (taking over from JSTOR), the DOI 10.2307/2436688 should be redirected to the Wiley page, and the Wiley page displays this DOI (i.e., 10.2307/2436688). If I want to get metadata for this DOI, I should be able to use CrossRef's API to retrieve that metadata, e.g. https://api.crossref.org/v1/works/10.2307/2436688 should return metadata for the article.

What actually happens

Wiley display the same article on their web site with the DOI 10.1002/j.1537-2197.1940.tb14217.x. They have minted a new DOI for the same article! The original JSTOR DOI now resolves to the Wiley page (you can see this using the Handle Resolver), which is what is supposed to happen. However, Wiley should have reused the original DOI rather than mint their own.

Furthermore, while the original DOI still resolves in a web browser, I can't retrieve metadata about that DOI from CrossRef, so any attempt to build upon that DOI fails. However, I can retrieve metadata for the Wiley DOI, i.e. https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x works, but https://api.crossref.org/v1/works/10.2307/2436688 doesn't.

Why does this matter?

For anyone using DOIs as stable links to the literature the persistence of DOIs is something you should be able to rely upon, both for people clicking on links in web browsers and developers getting metadata from those DOIs. The whole rationale of the DOI system is a single, globally unique identifier for each article, and that these DOIs persist even when the publisher of the content changes. If this property doesn't hold, then why would a developer such as myself invest effort in linking using DOIs?

Just for the record, I think CrossRef is great and is a hugely important part of the scholarly landscape. There are lots of things that I do that would be nearly impossible without CrossRef and its tools. But cases like this where we get massive duplication of DOIs when a publishers takes over an existing journal fundamentally breaks the underlying model of stable, persistent identifiers.

Thursday, February 03, 2022

Deduplicating bibliographic data

There are several instances where I have a collection of references that I want to deduplicate and merge. For example, in Zootaxa has no impact factor I describe a dataset of the literature cited by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4), as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1). Given that the same articles may be cited many times, these datasets have lots of duplicates. Similarly, articles in Wikispecies often have extensive lists of references cited, and the same reference may appear on multiple pages (for an initial attempt to extract these references see https://doi.org/10.5281/zenodo.5801661 and https://github.com/rdmpage/wikispecies-parser).

There are several reasons I want to merge these references. If I want to build a citation graph for Zootaxa or Phytotaxa I need to merge references that are the same so that I can accurate count citations. I am also interested in harvesting the metadata to help find those articles in the Biodiversity Heritage Library (BHL), and the literature cited section of scientific articles is a potential goldmine of bibliographic metadata, as is Wikispecies.

After various experiments and false starts I've created a repository https://github.com/rdmpage/bib-dedup to host a series of PHP scripts to deduplicate bibliographics data. I've settled on using CSL-JSON as the format for bibliographic data. Because deduplication relies on comparing pairs of references, the standard format for most of the scripts is a JSON array containing a pair of CSL-JSON objects to compare. Below are the steps the code takes.

Generating pairs to compare

The first step is to take a list of references and generate the pairs that will be compared. I started with this approach as I wanted to explore machine learning and wanted a simple format for training data, such as an array of two CSL-JSON objects and an integer flag representing whether the two references were the same of different.

There are various ways to generate CSL-JSON for a reference. I use a tool I wrote (see Citation parsing tool released) that has a simple API where you parse one or more references and it returns that reference as structured data in CSL-JSON.

Attempting to do all possible pairwise comparisons rapidly gets impractical as the number of references increases, so we need some way to restrict the number of comparisons we make. One approach I've explored is the “sorted neighbourhood method” where we sort the references 9for example by their title) then move a sliding window down the list of references, comparing all references within that window. This greatly reduces the number of pairwise comparisons. So the first step is to sort the references, then run a sliding window over them, output all the pairs in each window (ignoring in pairwise comparisons already made in a previous window). Other methods of "blocking" could also be used, such as only including references in a particular year, or a particular journal.

So, the output of this step is a set of JSON arrays, each with a pair of references in CSL-JSON format. Each array is stored on a single line in the same file in line-delimited JSON (JSONL).

Comparing pairs

The next step is to compare each pair of references and decide whether they are a match or not. Initially I explored a machine learning approach used in the following paper:

Wilson DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: The 2011 International Joint Conference on Neural Networks. 9–14. DOI: 10.1109/IJCNN.2011.6033192

Initial experiments using https://github.com/jtet/Perceptron were promising and I want to play with this further, but I deciding to skip this for now and just use simple string comparison. So for each CSL-JSON object I generate a citation string in the same format using CiteProc, then compute the Levenshtein distance between the two strings. By normalising this distance by the length of the two strings being compared I can use an arbitrary threshold to decide if the references are the same or not.

Clustering

For this step we read the JSONL file produced above and record whether the two references are a match or not. Assuming each reference has a unique identifier (needs only be unique within the file) then we can use those identifier to record the clusters each reference belongs to. I do this using a Disjoint-set data structure. For each reference start with a graph where each node represents a reference, and each node has a pointer to a parent node. Initially the reference is its own parent. A simple implementation is to have an array index by reference identifiers and where the value of each cell in the array is the node's parent.

As we discover pairs we update the parents of the nodes to reflect this, such that once all the comparisons are done we have a one or more sets of clusters corresponding to the references that we think are the same. Another way to think of this is that we are getting the components of a graph where each node is a reference and pair of references that match are connected by an edge.

In the code I'm using I write this graph in Trivial Graph Format (TGF) which can be visualised using a tools such as yEd.

Merging

Now that we have a graph representing the sets of references that we think are the same we need to merge them. This is where things get interesting as the references are similar (by definition) but may differ in some details. The paper below describes a simple Bayesian approach for merging records:

Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: 10.1145/1141753.1141817.

So the next step is to read the graph with the clusters, generate the sets of bibliographic references that correspond to each cluster, then use the method described in Councill et al. to produce a single bibliographic record for that cluster. These records could then be used to, say locate the corresponding article in BHL, or populate Wikidata with missing references.

Obviously there is always the potential for errors, such as trying to merge references that are not the same. As a quick and dirty check I flag as dubious any cluster where the page numbers vary among members of the cluster. More sophisticated checks are possible, especially if I go down the ML route (i.e., I would have evidence for the probability that the same reference can disagree on some aspects of metadata).

Summary

At this stage the code is working well enough for me to play with and explore some example datasets. The focus is on structured bibliographic metadata, but I may simplify things and have a version that handles simple string matching, for example to cluster together different abbreviations of the same journal name.

Sunday, January 02, 2022

Large graph viewer experiments

I keep returning to the problem of viewing large graphs and trees, which means my hard drive has accumulated lots of failed prototypes. Inspired by some recent discussions on comparing taxonomic classifications I decided to package one of these (wildly incomplete) prototypes up so that I can document the idea and put the code somewhere safe.

Very cool, thanks for sharing this-- the tree diff is similar to what J Rees has been cooking up lately with his 'cl diff' tool. I'll tag @beckettws in here too so he can see potential crossover. The goal is autogenerate diffs like this as 1st step to mapping taxo name-to concept
— Nate Upham (@n8_upham) December 28, 2021

Google Maps-like viewer

I've created a simple viewer that uses a tiled map viewer (like Google Maps) to display a large graph. The idea is to draw the entire graph scaled to a 256 x 256 pixel tile. The graph is stored in a database that supports geospatial queries, which means the queries to retrieve the individual tiles need to display the graph at different levels of resolution are simply bounding box queries to a database. I realise that this description is cryptic at best. The GitHub repository https://github.com/rdmpage/gml-viewer has more details and the code itself. There's a lot to do, especially adding support for labels(!) which presents some interesting challenges (levels of detail and generalization). The code doesn't do any layout of the graph itself, instead I've used the yEd tool to compute the x,y coordinates of the graph.

Since this exercise was inspired by a discussion of the ASM Mammal Diversity Database, the graph I've used for the demonstration above is the ASM classification of extant mammals. I guess I need to solve the labelling issue fairly quickly!

Monday, December 20, 2021

GraphQL for WikiData (WikiCite)

I've released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint is for a subset of the entities that are of interest to WikiCite, such as scholarly articles, people, and journals. There is a crude demo at https://wikicite-graphql.herokuapp.com. The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php. There are various ways to interact with the endpoint, personally I like the Altair GraphQL Client by Samuel Imolorhe.

As I've mentioned earlier it's taken me a while to see the point of GraphQL. But it is clear it is gaining traction in the biodiversity world (see for example the GBIF Hosted Portals) so it's worth exploring. My take on GraphQL is that it is a way to create a self-describing API that someone developing a web site can use without them having to bury themselves in the gory details of how data is internally modelled. For example, WikiData's query interface uses SPARQL, a powerful language that has a steep learning curve (in part because of the administrative overhead brought by RDF namespaces, etc.). In my previous SPARQL-based projects such as Ozymandias and ALEC I have either returned SPARQL results directly (Ozymandias) or formatted SPARQL results as schema.org DataFeeds (equivalent to RSS feeds) (ALEC). Both approaches work, but they are project-specific and if anyone else tried to build based on these projects they might struggle for figure out what was going on. I certainly struggle, and I wrote them!

So it seems worthwhile to explore this approach a little further and see if I can develop a GraphQL interface that can be used to build the sort of rich apps that I want to see. The demo I've created uses SPARQL under the hood to provide responses to the GraphQL queries. So in this sense it's not replacing SPARQL, it's simply providing a (hopefully) simpler overlay on top of SPARQL so that we can retrieve the data we want without having to learn the intricacies of SPARQL, nor how Wikidata models publications and people.

Saturday, December 11, 2021

The Business of Extracting Knowledge from Academic Publications

Markus Strasser (@mkstra write a fascinating article entitled "The Business of Extracting Knowledge from Academic Publications".

I spent months working on domain-specific search engines and knowledge discovery apps for biomedicine and eventually figured that synthesizing "insights" or building knowledge graphs by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc
— Markus Strasser (@mkstra) December 7, 2021

His TL;DR:

TL;DR: I worked on biomedical literature search, discovery and recommender web applications for many months and concluded that extracting, structuring or synthesizing "insights" from academic publications (papers) or building knowledge bases from a domain corpus of literature has negligible value in industry.

Close to nothing of what makes science actually work is published as text on the web.

After recounting the many problems of knowledge extraction - including a swipe at nanopubs which "are ... dead in my view (without admitting it)" - he concludes:

I’ve been flirting with this entire cluster of ideas including open source web annotation, semantic search and semantic web, public knowledge graphs, nano-publications, knowledge maps, interoperable protocols and structured data, serendipitous discovery apps, knowledge organization, communal sense making and academic literature/publishing toolchains for a few years on and off ... nothing of it will go anywhere.

Don’t take that as a challenge. Take it as a red flag and run. Run towards better problems.

Well worth a read, and much food for thought.

Tuesday, November 23, 2021

Revisiting RSS to monitor the latest taxonomic research

Over a decade ago RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of interest as a way to integrate data across various websites. Many science publishers would provide a list of their latest articles in XML in one of three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS [1] and my own e-Biosphere Challenge: visualising biodiversity digitisation in real time. It was a time of enthusiasm for aggregating lots of data, such as the ill-fated PLoS Biodiversity Hub [2].

Since I seem to be condemned to revisit old ideas rather than come up with anything new, I've been looking at providing a tool like the now defunct uBioRSS. The idea is to harvest RSS feeds from journals (with an emphasis on taxonomic and systematic journals), aggregate the results, and make them browsable by taxon and geography. Here's a sneak peak:

What seems like a straightforward task quickly became a bit of a challenge. Not all journals have RSS feeds (they seem to have become less widely supported over time) so I need to think of alternative ways to get lists of recent articles. These lists also need to be processed in various ways. There are three versions of RSS, each with their own idiosyncracies, so I need to standardise things like dates. I also want to augment them with things like DOIs (often missing from RSS feeds) and thumbnails for the articles (often available on publisher websites but not the feeds). Then I need to index the content by taxon and geography. For taxa I use a version of Patrick Leary's "taxonfinder" (see https://right-frill.glitch.me) to find names, then the Global Names Index to assign names found to the GBIF taxonomic hierarchy.

Indexing by geography proved harder. Typically geoparsing involves taking a body of text and doing the following:

Using named-entity recognition NER to identity named entities in the text (e.g., place names, people names, etc.).
Using a gazetteer of geographic names GeoNames to try and match the place names found by NER.

An example of such a parser is the Edinburgh Geoparser. Typically geoparsing software can be large and tricky to install, especially if you are looking to make your installation publicly accessible. Geoparsing services seem to have a short half-life (e.g., Geoparser.io), perhaps because they are so useful they quickly get swamped by users.

Bearing this in mind, the approach I’ve taken here is to create a very simple geoparser that is focussed on fairly large areas, especially those relevant to biodiversity, and is aimed at geoparsing text such as abstracts of scientific papers. I've created a small database of places by harvesting data from Wikidata, then I use the "flash text" algorithm [3] to find geographic places. This approach uses a trie to store the place names. All I do is walk through the text seeing whether the current word matches a place name (or the start of one) in the trie, then moving on. This is very quick and seems to work quite well.

Given that I need to aggregate data from a lot of sources, apply various transformations to that data, then merge it, there are a lot of moving parts. I started playing with a "NoCode" platform for creating workflows, in this case n8n (in many ways reminiscent of the now defunct Yahoo Pipes). This was quite fun for a while, but after lots of experimentation I moved back to writing code to aggregate the data into a CouchDB database. CouchDB is one of the NoSQL databases that I really like as it has a great interface, and makes queries very easy to do once you get your head around how it works.

So the end result of this is "BioRSS" https://biorss.herokuapp.com. The interface comprises a stream of articles listed from newest to oldest, with a treemap and a geographic map on the left. You can use these to filter the articles by taxonomic group and/or country. For example the screen shot is showing arthropods from China (in this case from a month or two ago in the journal ZooKeys). As much fun as the interface has been to construct, in many ways I don't really want to spend time making an interface. For each combination of taxon and country I provide a RSS feed so if you have a favour feed reader you can grab the feed and view it there. As BioRSS updates the data your feed reader should automatically update the feed. This means that you can have a feed that monitors, say, new papers on spiders in China.

In the spirit of "release early and release often" this is an early version of this app. I need to add a lot more feeds, back date them to bring in older content, and I also want to make use of aggregators such as PubMed, CrossRef, and Google Scholar. The existence of these tools is, I suspect, one reason why RSS feeds are less common than they used to be.

So, if this sounds useful please take it for a spin at https://biorss.herokuapp.com. Feedback is welcome, especially suggestions for journals to harvest and add to the news feed. Ultimately I'd like to have sufficient coverage of the taxonomic literature so that BioRSS becomes a place where we can go to find the latest papers on any taxon of interest.

References

1. Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109

2. Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491

3. Singh, V. (2017). Replace or Retrieve Keywords In Documents at Scale. CoRR, abs/1711.00046. http://arxiv.org/abs/1711.00046

Monday, October 25, 2021

Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?

The Plazi project has become one of the major contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences (see Plazi's GBIF page for details). These occurrences are extracted from taxonomic publication using automated methods. New data is published almost daily (see latest treatments). The map below shows the geographic distribution of material citations provided to GBIF by Plazi, which gives you a sense of the size of the dataset.

By any metric Plazi represents a considerable achievement. But often when I browse individual records on Plazi I find records that seem clearly incorrect. Text mining the literature is a challenging problem, but at the moment Plazi seems something of a "black box". PDFs go in, the content is mined, and data comes up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does there seem to be an evaluation of how accurate this text mining actually is. Anecdotally it seems to work well in some cases, but in others it produces what can only be described as bogus records.

Finding errors

A treatment in Plazi is a block of text (and sometimes illustrations) that refers to a single taxon. Often that text will include a description of the taxon, and list one or more specimens that have been examined. These lists of specimens ("material citations") are one of the key bits of information that Plaza extracts from a treatment as these citations get fed into GBIF as occurrences.

To help explore treatments I've constructed a simple web site that takes the Plazi identifier for a treatment and displays that treatment with the material citations highlighted. For example, for the Plazi treatment 03B5A943FFBB6F02FE27EC94FABEEAE7 you can view the marked up version at https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228. Below is an example of a material citation with its component parts tagged:

This is an example where Plazi has successfully parsed the specimen. But I keep coming across cases where specimens have not been parsed correctly, resulting in issues such as single specimens being split into multiple records (e.g., https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496), geographical coordinates being misinterpreted (e.g., https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9), or collector's initials being confused with codes for natural history collections (e.g., https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E).

Parsing specimens is a hard problem so it's not unexpected to find errors. But they do seem common enough to be easily found, which raises the question of just what percentage of these material citations are correct? How much of the data Plazi feeds to GBIF is correct? How would we know?

Systemic problems

Some of the errors I've found concern the interpretation of the parsed data. For example, it is striking that despite including marine taxa no Plazi record has a value for depth below sea level (see GBIF search on depth range 0-9999 for Plazi). But many records do have an elevation, including records from marine environments. Any record that has a depth value is interpreted by Plazi as being elevation, so we have aerial crustacea and fish.

Map of Plazi records with depth 0-9999m

Map of Plazi records with elevation 0-9999m

Anecdotally I've also noticed that Plazi seems to do well on zoological data, especially journals like Zootaxa, but it often struggles with botanical specimens. Botanists tend to cite specimens rather differently to zoologists (botanists emphasise collector numbers rather than specimen codes). Hence data quality in Plazi is likely to taxonomic biased.

Plazi is using GitHub to track issues with treatments so feedback on erroneous records is possible, but this seems inadequate to the task. There are tens of thousands of data sets, with more being released daily, and hundreds of thousands of occurrences, and relying on GitHub issues devolves the responsibility for error checking onto the data users. I don't have a measure of how many records in Plazi have problems, but because I suspect it is a significant fraction because for any given day's output I can typically find errors.

What to do?

Faced with a process that generates noisy data there are several of things we could do:

Have tools to detect and flag errors made in generating the data.
Have the data generator give estimates the confidence of its results.
Improve the data generator.

I think a comparison with the problem of parsing bibliographic references might be instructive here. There is a long history of people developing tools to parse references (I've even had a go). State-of-the art tools such as AnyStyle feature machine learning, and are tested against human curated datasets of tagged bibliographic records. This means we can evaluate the performance of a method (how well does it retrieve the same results as human experts?) and also improve the method by expanding the corpus of training data. Some of these tools can provide a measures of how confident they are when classifying a string as, say, a person's name, which means we could flag potential issues for anyone wanting to use that record.

We don't have equivalent tools for parsing specimens in the literature, and hence have no easy way to quantify how good existing methods are, nor do we have a public corpus of material citations that we can use as training data. I blogged about this a few months ago and was considering using Plazi as a source of marked up specimen data to use for training. However based on what I've looked at so far Plazi's data would need to be carefully scrutinised before it could be used as training data.

Going forward, I think it would be desirable to have a set of records that can be used to benchmark specimen parsers, and ideally have the parsers themselves available as web services so that anyone can evaluate them. Even better would be a way to contribute to the training data so that these tools improve over time.

Plazi's data extraction tools are mostly desktop-based, that is, you need to download software to use their methods. However, there are experimental web services available as well. I've created a simple wrapper around the material citation parser, you can try it at https://plazi-tester.herokuapp.com/parser.php. It takes a single material citation and returns a version with elements such as specimen code and collector name tagged in different colours.

Summary

Text mining the taxonomic literature is clearly a gold mine of data, but at the same time it is potentially fraught as we try and extract structured data from semi-structured text. Plazi has demonstrated that it is possible to extract a lot of data from the literature, but at the same time the quality of that data seems highly variable. Even minor issues in parsing text can have big implications for data quality (e.g., marine organisms apparently living above sea level). Historically in biodiversity informatics we have favoured data quantity over data quality. Quantity has an obvious metric, and has milestones we can celebrate (e.g., one billion specimens). There aren't really any equivalent metrics for data quality.

Adding new types of data can sometimes initially result in a new set of quality issues (e.g., GBIF metagenomics and metacrap) that take time to resolve. In the case of Plazi, I think it would be worthwhile to quantify just how many records have errors, and develop benchmarks that we can use to test methods for extracting specimen data from text. If we don't do this then there will remain uncertainty as to how much trust we can place in data mined from the taxonomic literature.

Update

Plazi has responded, see Liberating material citations as a first step to more better data. My reading of their repsonse is that it essentially just reiterates Plazi's approach and doesn't tackle the underlying issue: their method for extracting material citations is error prone, and many of those errors end up in GBIF.

Thursday, October 07, 2021

Reflections on "The Macroscope" - a tool for the 21st Century?

YtNkVT2U This is a guest post by Tony Rees.

It would be difficult to encounter a scientist, or anyone interested in science, who is not familiar with the microscope, a tool for making objects visible that are otherwise too small to be properly seen by the unaided eye, or to reveal otherwise invisible fine detail in larger objects. A select few with a particular interest in microscopy may also have encountered the Wild-Leica "Macroscope", a specialised type of benchtop microscope optimised for low-power macro-photography. However in this overview I discuss the "Macroscope" in a different sense, which is that of the antithesis to the microscope: namely a method for visualizing subjects too large to be encompassed by a single field of vision, such as the Earth or some subset of its phenomena (the biosphere, for example), or conceptually, the universe.

My introduction to the term was via addresses given by Jesse Ausubel in the formative years of the 2001-2010 Census of Marine Life, for which he was a key proponent. In Ausubel's view, the Census would perform the function of a macroscope, permitting a view of everything that lives in the global ocean (or at least, that subset which could realistically be sampled in the time frame available) as opposed to more limited subsets available via previous data collection efforts. My view (which could, of course, be wrong) was that his thinking had been informed by a work entitled "Le macroscope, vers une vision globale" published in 1975 by the French thinker Joël de Rosnay, who had expressed such a concept as being globally applicable in many fields, including the physical and natural worlds but also extending to human society, the growth of cities, and more. Yet again, some ecologists may also have encountered the term, sometimes in the guise of "Odum's macroscope", as an approach for obtaining "big picture" analyses of macroecological processes suitable for mathematical modelling, typically by elimination of fine detail so that only the larger patterns remain, as initially advocated by Howard T. Odum in his 1971 book "Environment, Power, and Society".

From the standpoint of the 21st century, it seems that we are closer to achieving a "macroscope" (or possibly, multiple such tools) than ever before, based on the availability of existing and continuing new data streams, improved technology for data assembly and storage, and advanced ways to query and combine these large streams of data to produce new visualizations, data products, and analytical findings. I devote the remainder of this article to examples where either particular workers have employed "macroscope" terminology to describe their activities, or where potentially equivalent actions are taking place without the explicit "macroscope" association, but are equally worthy of consideration. To save space here, references cited here (most or all) can be found via a Wikipedia article entitled "Macroscope (science concept)" that I authored on the subject around a year ago, and have continued to add to on occasion as new thoughts or information come to hand (see edit history for the article).

First, one can ask, what constitutes a macroscope, in the present context? In the Wikipedia article I point to a book "Big Data - Related Technologies, Challenges and Future Prospects" by Chen et al. (2014) (doi:10.1007/978-3-319-06245-7), in which the "value chain of big data" is characterised as divisible into four phases, namely data generation, data acquisition (aka data assembly), data storage, and data analysis. To my mind, data generation (which others may term acquisition, differently from the usage by Chen et al.) is obviously the first step, but does not in itself constitute the macroscope, except in rare cases - such as Landsat imagery, perhaps - where on its own, a single co-ordinated data stream is sufficient to meet the need for a particular type of "global view". A variant of this might be a coordinated data collection program - such as that of the ten year Census of Marine Life - which might produce the data required for the desired global view; but again, in reality, such data are collected in a series of discrete chunks, in many and often disparate data formats, and must be "wrangled" into a more coherent whole before any meaningful "macroscope" functionality becomes available.

Here we come to what, in my view, constitutes the heart of the "macroscope": an intelligently organized (i.e. indexable and searchable), coherent data store or repository (where "data" may include imagery and other non numeric data forms, but much else besides). Taking the Census of Marine Life example, the data repository for that project's data (plus other available sources as inputs) is the Ocean Biodiversity Information System or OBIS (previously the Ocean Biogeographic Information System), which according to this view forms the "macroscope" for which the Census data is a feed. (For non habitat-specific biodiversity data, GBIF is an equivalent, and more extensive, operation). Other planetary scale "macroscopes", by this definition (which may or may not have an explicit geographic, i.e. spatial, component) would include inventories of biological taxa such as the Catalogue of Life and so on, all the way back to the pioneering compendia published by Linnaeus in the eighteenth century; while for cartography and topographic imagery, the current "blockbuster" of Google Earth and its predecessors also come well into public consciousness.

In the view of some workers and/or operations, both of these phases are precursors to the real "work" of the macroscope which is to reveal previously unseen portions of the "big picture" by means either of the availability of large, synoptic datasets, or fusion between different data streams to produce novel insights. Companies such as IBM and Microsoft have used phraseology such as:

By 2022 we will use machine-learning algorithms and software to help us organize information about the physical world, helping bring the vast and complex data gathered by billions of devices within the range of our vision and understanding. We call this a "macroscope" – but unlike the microscope to see the very small, or the telescope that can see far away, it is a system of software and algorithms to bring all of Earth's complex data together to analyze it by space and time for meaning." (IBM)

As the Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors, we will gain a better understanding of our environment via a virtual, distributed whole-Earth "macroscope"... Massive-scale data analytics will enable real-time tracking of disease and targeted responses to potential pandemics. Our virtual "macroscope" can now be used on ourselves, as well as on our planet." (Microsoft) (references available via the Wikipedia article cited above).

Whether or not the analytical capabilities described here are viewed as being an integral part of the "macroscope" concept, or are maybe an add-on, is ultimately a question of semantics and perhaps, personal opinion. Continuing the Census of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization and summary tools, but also makes its data available for download to users wishing to analyse it further according to their own particular interests; using OBIS data in this manner, Mark Costello et al. in 2017 were able to demarcate a finite number of data-supported marine biogeographic realms for the first time (Costello et al. 2017: Nature Communications. 8: 1057. doi:10.1038/s41467-017-01121-2), a project which I was able to assist in a small way in an advisory capacity. In a case such as this, perhaps the final function of the macroscope, namely data visualization and analysis, was outsourced to the authors' own research institution. Similarly at an earlier phase, "data aggregation" can also be virtual rather than actual, i.e. avoiding using a single physical system to hold all the data, enabled by open web mapping standards WMS (web map service) and WFS (web feature service) to access a set of distributed data stores, e.g. as implemented on the portal for the Australian Ocean Data Network.

So, as we pass through the third decade of the twenty first century, what developments await us in the "macroscope" area"? In the biodiversity space, one can reasonably presume that the existing "macroscopic" data assembly projects such as OBIS and GBIF will continue, and hopefully slowly fill current gaps in their coverage - although in the marine area, strategic new data collection exercises may be required (Census 2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will continue its progress towards a "complete" species inventory for the biosphere. The Landsat project, with imagery dating back to 1972, continues with the launch of its latest satellite Landsat 9 just this year (21 September 2021) with a planned mission duration for the next 5 years, so the "macroscope" functionality of that project seems set to continue for the medium term at least. Meanwhile the ongoing development of sensor networks, both on land and in the ocean, offers an exciting new method of "instrumenting the earth" to obtain much more real time data than has ever been available in the past, offering scope for many more, use case-specific "macroscopes" to be constructed that can fuse (e.g.) satellite imagery with much more that is happening at a local level.

So, the "macroscope" concept appears to be alive and well, even though the nomenclature can change from time to time (IBM's "Macroscope", foreshadowed in 2017, became the "IBM Pairs Geoscope" on implementation, and is now simply the "Geospatial Analytics component within the IBM Environmental Intelligence Suite" according to available IBM publicity materials). In reality this illustrates a new dichotomy: even if "everyone" in principle has access to huge quantities of publicly available data, maybe only a few well funded entities now have the computational ability to make sense of it, and can charge clients a good fee for their services...

I present this account partly to give a brief picture of "macroscope" concepts today and in the past, for those who may be interested, and partly to present a few personal views which would be out of scope in a "neutral point of view" article such as is required on Wikipedia; also to see if readers of this blog would like to contribute further to discussion of any of the concepts traversed herein.

Friday, August 27, 2021

JSON-LD in the wild: examples of how structured data is represented on the web

I've created a GitHub repository so that I can keep track of the examples of JSON-LD that I've seen being actively used, for example embedded in web sites, or accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld. The list is by no means exhaustive, I hope to add more examples as I come across them.

One reason for doing this is to learn what others are doing. For example, after looking at SciGraph's JSON-LD I now see how an ordered list can be modelled in RDF in such a way that the list of authors in a JSON-LD document for, say a scientific paper, is correct. By default RDF has no notion of ordered lists, so if you do a SPARQL query to get the authors of a paper, the order of the authors returned in the query will be arbitrary. There are various ways to try and tackle this. In my Ozymandias knowledge graph I used "roles" to represent order (see Figure 2 in the Ozymandias paper). I then used properties of the role to order the list of authors.

Another approach is to use rdf:lists (see RDF lists and SPARQL and Is it possible to get the position of an element in an RDF Collection in SPARQL? for an introduction to lists). SciGraph uses this approach. The value for schema:author is not an author, but a blank node (bnode), and this bnode has two predicates, rdf:first and rdf:rest. One points to an author, the other points to another bnode. This pattern repeats until we encounter a value of rdf:nil for rdf:rest.

This introduces some complexity, but the benefit is that the JSON-LD version of the RDF will have the authors in the correct order, and hence any client that is using JSON will be able to treat the array of authors as ordered. Without some means of ordering the client could not make this assumption, hence the first author in the list might not actually be the first author of the paper.

Friday, July 23, 2021

Species Cite: linking scientific names to publications and taxonomists

I've made Species Cite live. This is a web site I've been working on with the GBIF Challenge as a notional deadline so I'll actually get something out the door.

"Species Cite" takes as its inspiration the suggestion that citing original taxonomic descriptions (and subsequent revisions) would increase citation metrics for taxonomists, and give them the credit they deserve. Regardless of the merits of this idea, it is difficult to implement because we don’t have an easy way of discovering which paper we should cite. Species Cite tackles this by combining millions of taxonomic name records linked to LSIDs with bibliographic data from Wikidata to make it easier to cite the sources of taxonomic names. Where possible it provides access to PDFs for articles using Internet Archive, or Unpaywall. These can be displayed in an embedded PDF viewer. Given the original motivation of surfacing the work of taxonomists, Species Cite also attempts to display information about the authors of a taxonomic paper, such as ORCID and/or Wikidata identifiers, and an avatar for the author via either Wikidata or ResearchGate. This enables us to get closer to the kind of social interface found in citizen science projects like iNaturalist where participants are people with personalities, not simply strings of text. Furthermore by identifying people and associating them with taxa it could help us discover who are the experts on particular taxonomic groups, and also enable those people to easily establish that they are, in fact, experts.

How it works

Under the hood there's a lot of moving pieces. The taxonomic names come from a decade or more of scraping LSIDs from various taxonomic databases, primarily ION, IPNI, Index Fungorum, and Nomenclator Zoologicus. Given that these LSIDs are often offline I built caches one and two to make them accessible (see It's been a while...).

The bibliographic data is stored in Wikidata, and I've built an app to explore that data (see Wikidata and the bibliography of life in the time of coronavirus) and also a simple search engine to find things quickly (see Towards a WikiCite search engine). I've also spent way more than I'd care to admit adding taxonomic literature to Wikidata and Internet Archive.

The map between names and literature his based on work I've done with BioNames and various unpublished projects.

To make things a bit more visually interesting I've used images of taxa from Phylopic, and also harvested images from ResearchGate to supplement the rather limited number of images of taxonomists in Wikidata.

One of the things I've tried to do is avoid making new databases, as those often die from neglect. Hence the use of Wikidata for bibliographic data. The taxonomic data is held in static files in the LSID caches. The mapping between names and publications is a single (large) tab-delimited file that is searched on disk using a crude binary search on a sorted list of taxonomic names. This means you can download the Github repository and be up and running without installing a database. Likewise the LSID caches use static (albeit compressed) files. The only database directly involved is the WikiCite search engine.

Once all the moving bits come together, you can start to display things like a plant species together with it's original description and the taxonomists who decsribed that species (Garcinia nuntasaenii):

What's next

There is still so much to do. I need to add lots of taxonomic literature from my own BioStor project and other sources, and the bibliographic data in Wikidata needs constant tending and improving (which is happening, see Preprint on Wikidata and the bibliography of life). And at some point I need to think about how to get the links between names and literature into Wikidata.

Anyway, the web site is live at https://species-cite.herokuapp.com.

Update

I've created a 5 min screencast walking you through the site.

Thursday, July 22, 2021

Towards a WikiCite search engine

I've released a simple search engine for publications in Wikidata. Wikicite Search takes its name from the WikiCite project, which was an initiative to create a bibliographic database in Wikidata. Since bibliographic data is a core component of taxonomic research (arguably taxonomy is mostly tracing the fate of the "tags" we call taxonomic names) I've spent some time getting taxonomic literature into Wikidata. Since there are bots already adding articles by harvesting sources such as CrossRef and PubMed, I've focussed on literature that is harder to add, such as articles with non-CrossRef DOIs, or those without DOIs at all.

Once you have a big database, you are then faced with the challenge of finding things in that database. Wikidata supports generic search, but I wanted something more closely geared to bibliographic data. Hence Wikicite Search. Over the last few years I've made several attempts at a bibliographic search engine, for this project I've finally settled on some basic ideas:

The core data structure is CSL-JSON, a simple but rich JSON format for expressing bibliographic data.
The search engine is Elasticsearch. The documents I upload include the CSL-JSON for an article, but also a simple text representation of the article. This text representation may include multiple languages if, for example, the article has a title in more than one language. This means that if an article has both English and Chinese titles you can find it searching in either language.
The web interface is very simple: search for a term, get results. If the search term is a Wikidata identifier you get just the corresponding article, e.g. Q98715368.
There is a reconciliation API to help match articles to the database. Paste in one citation per line and you get back matches (if found) for each citation.
Where possible I display a link to a PDF of the article, which is typically stored in the Internet Archive or accessible via the Wayback Machine.

There are millions of publications in Wikidata, currently less than half a million are in my search engine. My focus is narrowly on eukaryote taxonomy and related topics. I will be adding more articles as time permits. I also periodically reload existing articles to capture updates to the metadata made by the Wikidata community - being a wiki the data in Wikidata is constantly evolving.

My goal is to have a simple search tool that focusses on matching citation strings. In other words, it is designed to find a reference you are looking for, rather than be a tool to search the taxonomic literature. If that sounds contradictory, consider that my tool will only find a paper about a taxon if it is explicitly named in the title. A more sophisticated search engine would support things like synonym resolution, etc.

The other reason I built this is to provide an API for accessing Wikidata items and displaying them in other formats. For example, an article in the WikiCite search engine can be retrieved in CSL-JSON format, or in RDF as JSON-LD.

As always, it's very early days. But I don't think it's unreasonable to imagine that as Wikidata grows we could envisage having a search engine that includes the bulk of the taxonomic literature.

Citation parsing tool released

Quick note on a tool I've been working on to parse citations, that is to take a series of strings such as:

Möllendorff O (1894) On a collection of land-shells from the Samui Islands, Gulf of Siam. Proceedings of the Zoological Society of London, 1894: 146–156.
de Morgan J (1885) Mollusques terrestres & fluviatiles du royaume de Pérak et des pays voisins (Presqúile Malaise). Bulletin de la Société Zoologique de France, 10: 353–249.
Morlet L (1889) Catalogue des coquilles recueillies, par M. Pavie dans le Cambodge et le Royaume de Siam, et description ďespèces nouvelles (1). Journal de Conchyliologie, 37: 121–199.
Naggs F (1997) William Benson and the early study of land snails in British India and Ceylon. Archives of Natural History, 24:37–88.

and return structured data. This is an old problem, and pretty much a "solved" problem. See for example AnyStyle. I've played with AnyStyle and it's great, but I had to install it on my computer rather than simply use it as a web service. I also wanted to explore the approach a bit more as a possible a model for finding citations of specimens.

Loving Sylvester Keil's AnyStyle reference parser https://t.co/pcbvctr5vf, but not loving the whole Ruby experience (whadda mean I need to upgrade Ruby on my Mac to install it?). Oh for a Docker version of a web service... still, very cool tool.
— Roderic Page (@rdmpage) July 7, 2020

After trying to install the underlying conditional random fields (CRF) engine used by AnyStyle and running into a bunch of errors, I switched to a tool I could get working, namely CRF++. After figuring out how to compiling a C++ application to run on Heroku I started to wonder how to use this as the basis of a citation parser. Fortunately, I had used the Perl-based ParsCit years ago, and managed to convert the relevant bits to PHP and build a simple web service around it.

Although I've abandoned the Ruby-based AnyStyle I do use AnyStyle's XML format for the training data. I also built a crude editor to create small training data sets that uses a technique published by the author of the blogging tool I'm using to write this post (see MarsEdit Live Source Preview). Typically I use this to correctly annotate examples where the parser failed. Over time I add these to the training data and the performance gets better.

This is pretty much a side project or a side project, but ultimately the goal is to employ it to help extract citation data from publications, both to generate data to populate (BioStor), and also start to flesh out the citation graph for publications in Wikidata.

If you want to play with the tool it is at https://citation-parser.herokuapp.com. At the moment it takes some citation strings and returns the result in CSL-JSON, which is becoming the default way to represent structured bibliographic data. Code is on GitHub.

Tuesday, June 15, 2021

Compiling a C++ application to run on Heroku

TL;DR Use a buildpack and set "LDFLAGS=--static" --disable-shared

I use Heroku to host most of my websites, and since I mostly use PHP for web development this has worked fine. However, every so often I write an app that calls an external program written in, say, C++. Up until now I've had to host these apps on my own web servers. Today I finally bit the bullet and learned how to add a C++ program to a Heroku-hosted site.

In this case I wanted to add CRF++ to an app for parsing citations. I'd read on Stack Overflow that you could simply log into your Heroku instance using

heroku run bash

and compile the code there. I tried that for CRF++ but got a load of g++ errors, culminating in:

configure: error: Your compiler is not powerful enough to compile CRF++.

Turns out that the g++ compiler is only available at build time, that is, when the Heroku instance is being built before it is deployed. Once it is deployed g++ is no longer available (I'm assuming because Heroku tries to keep each running instance as small as possible).

So, next I tried using a buildpack, specifically felkr/heroku-buildpack-cpp. I forked this buildpack, and added it to my Heroku app (using the "Settings" tab). I put the source code for CRF++ into the root folder of the GitHub repository for the app (which makes things messy but this is where the buildpack looks for either Makefile or configure) then when the app is deployed CRF++ is compiled. Yay! Update: with a couple of tweaks I moved all the code into a folder called src and now things are a bit tidier.

Not so fast, I then did

heroku run bash

and tried running the executable:


heroku run bash -a <my app name>

./crf_learn

/app/.libs/crf_learn: error while loading shared libraries: libcrfpp.so.0: cannot open shared object file: No such file or directory

For whatever reason the executable is looking for a shared library which doesn’t exist (this brought back many painful memories of dealing with C++ compilers on Macs, Windows, and Linux back in the day). To fix this I edited the buildpack compile script to set the "LDFLAGS=--static" --disable-shared flags for configure. This tells the compiler to build static versions of the libraries and executable. Redeploying the app once again everything now worked!

The actual website itself is a mess at the moment ~~so I won't share the link just yet~~Update: see Citation parsing tool released for details, but it's great to know that I can have both a C++ executable and a PHP script hosted together without (too much) pain. As always, Google and Stack Overflow are your friends.

Friday, June 04, 2021

Thoughts on BHL, ALA, GBIF, and Plazi

If you compare the impact that BHL and Plazi have on GBIF, then it's clear that BHL is almost invisible. Plazi has successfully in carved out a niche where they generate tens of thousands of datasets from text mining the taxonomic literature, whereas BHL is a participant in name only. It's not as if BHL lacks geographic data. I recently added back a map display in BioStor where each dot is a pair of latitude and longitude coordinates mentioned in an article derived from BHL's scans.

This data has the potential to fill in gaps in our knowledge of species distributions. For example, the Atlas of Living Australia (ALA) shows the following map for the cladoceran (water flea) Simocephalus:

Compare this to the localities mentioned in just one paper on this genus:

Timms, B. V. (1989). Simocephalus Schoedler (Cladocera: Daphniidae) in tropical Australia. The Beagle, 6, 89–96. Retrieved from https://biostor.org/reference/241776

There are records in this paper for species that currently have no records at all in ALA (e.g., Simocephalus serrulatus):

As it stands BioStor simply extracts localities, it doesn't extract the full "material citation" from the text (that is, the specimen code, date collected, locality, etc. for each occurrence). If it did, it would then be in a position to contribute a large amount of data to ALA and GBIF (and elsewhere). Not only that, if it followed the Plazi model this contribution would be measurable (for example, in terms of numbers of records added, and numbers of data citations). Plazi makes some of its parsing tools available as web services (e.g., http://tb.plazi.org/GgWS/wss/test and https://github.com/gsautter/goldengate-webservices), so in principle we could parse BHL content and extract data in a form usable by ALA and GBIF.

Notes on Plazi web service

The endpoint is http://tb.plazi.org/GgWS/wss/invokeFunction and it accepts POST requests, e.g. data=Namibia%3A%2058%20km%20W%20of%20Kamanjab%20Rest%20Camp%20on%20road%20to%20Grootberg%20Pass%20%2819%C2%B038%2757%22S%2C%2014%C2%B024%2733%22E%29&functionName=GeoCoordinateTaggerNormalizing.webService&dataUrl=&dataFormat=TXT and returns XML.

Friday, May 28, 2021

Finding citations of specimens

Note to self.

The challenge of finding specimen citations in papers keeps coming around. It seems that this is basically the same problem as finding citations to papers, and can be approached in much the same way.

If you want to build a database of reference from scratch, one way is to scrape citations from papers (e.g., from the "literature cited" section), convert those strings into structured data, and add those to your database. In the early days of bibliographic searching this was a common strategy (and I still use it to help populate Wikidata).

Regular expressions are powerful but also brittle, you need to keep tweaking them to accommodate all the minor ways citation styles can differ. This leads to more sophisticated (and hopefully robust) approaches, such as machine learning. Conditional random fields (CRF) are a popular technique, pioneered by tools like Parscite and most recently used in the very elegant anystyle.io. You paste in a citation string and you get back that citation with all the component parts (authors, title, journal, pagination, etc.) separated out. Approaches like this require training data to teach the parser how to recognise the parts of a citation string. One obvious way to generate training data is to have a large bibliographic database, a set of "style sheets" describing all the ways different journals represent citations (e.g., citationstyles.org), and then you can generate lots of training data.

Over time the need for citation parsing has declined somewhat, being replaced by simple fulltext search (exemplified by this Tweet);

We don't need no stinkin' parser- a guide to resolving free-form citations with #CrossRef Metadata Search -> goo.gl/f9a4e
— Geoffrey Bilder (@gbilder) October 18, 2012

Again, in the early days a common method of bibliographic search was to search by keys such as journal name (or ISSN), volume number, and starting page. So you had to atomise the reference into its parts, then search for something that matched those parts. This is tedious (OpenURL anyone?), but helps reduce false matches. If you only have a small bibliographic database searching for reference by string matching can be frustrating because you are likely to get lots of matches, but none of them to the reference you are actually looking for. Given how search works you'll pretty much always get some sort of match. What really helps is if the database has the answer to your search (this is one reason Google is so great, if you have indexed the whole web chances are you have the answer somewhere already). Now that CrossRef's database has grown much larger you can search for a reference using a simple string search and be reasonably confident of getting the a genuine hit. The need to atomise a reference for searching is disappearing.

So, armed with a good database and good search tools we can avoid parsing references. Search also opens up other possibilities, such as finding citations using full text search. Given a reference how do you find where it's been cited? One approach is to parse the text of a reference (A), extract the papers in the "literature cited" section (B, C, D, etc.), match those to a database, and add the "A cites B", "A cites C", etc. links to the database. This will answer "what papers does A cite?" but not "what other papers cite C?". One approach to that question would be to simply take the reference C, convert it to a citation string, then blast through all the full text you could find looking for matches to that citation string - these are likely to be papers that cite reference C. In other words, you are finding that string in the "literature cited" section of the citing papers.

So, to summarise:

To recognise and extract citations as structured data from text we can use regular expressions and/or machine learning.
Training data for machine learning can be generated from existing bibliographic data coupled with rules for generating citation strings.
As bibliographic databases grow in size the need for extracting and parsing citations diminishes. Our databases will have most of the citations already, so that using search is enough to find what we want.
To build a citation database we can parse the literature cited section and extract all references cited by a paper ("X cites")
Another approach to building a citation database is to tackle the reverse question, namely "X is cited by". This can be done by a full text search for citation strings corresponding to X.

How does this relate to citing specimens you ask? Well, I think the parallels are very close:

We could use CRF approaches to have something like anystyle.io for specimens. Paste in a specimen from the "Materials examined" section of a paper and have it resolved into its component parts (e.g., collector, locality, date).
We have a LOT of training data in the form of GBIF. Just download data in Darwin Core format, apply various rules for how specimens are cited in the literature, and we have our training data.
Using our specimen parser we could process the "Materials examined" section of a paper to find the specimens (Plazi extracts specimens from papers, although it's not clear to me how automated this is.)
We could also do the reverse: take a Darwin Core Archive for, say, a single institution, generate all the specimen citation strings you'd expect to see people use in their papers, then go search through the full text of papers (e.g., in PubMed Central and BHL) looking for those strings - those are citations of your specimens.

There seems a lot of scope for learning from the experience of people working with bibliographic citations, especially how to build parsers, and the role that "stylesheets" could play in helping to understand how people cite specimens. Obviously, a lot of this would be unnecessary if there was a culture of using and citing persistent identifiers for specimens, but we seem to be a long way from that just yet.

Maximum entropy summary trees to display higher classifications

How to cite: Page, R. (2021). Maximum entropy summary trees to display higher classifications https://doi.org/10.59350/af01t-6sw74

A challenge in working with large taxonomic classifications is how you display them to the user, especially if the user probably doesn't want all the gory details. For example, the Field Guide app to Victorian Fauna has a nice menu of major animal groups:

This includes both taxonomic and ecological categories (e.g., terrestrial, freshwater, etc.) and greatly simplifies the animal tree of life, but it is a user-friendly way to start browsing a larger database of facts about animals. It would be nice if we could automate constructing such lists, especially for animal groups where the choices of what to display might not seem obvious (everyone wants to see birds, but what insects groups would you prioritise?).

One way to help automate these sort of lists is to use summary trees (see also Karloff, H., & Shirley, K. E. (2013). Maximum Entropy Summary Trees. Computer Graphics Forum, 32(3pt1), 71–80. doi:10.1111/cgf.12094). A summary tree takes a large tree and produces a small summary for k nodes, where k is a number that you supply. In other words, if you want your summary to have 10 nodes then k=10. The diagram below summarises an organisation chart for 43,134 employees.

Summary trees show only a subset of the nodes in the complete tree. All the nodes with a given parent that aren't displayed get aggregated into a newly created "others" node that is attached to that parent. Hence the summary tree alerts the user that there are nodes which list in the full tree but which aren't shown.

Code for maximum entropy summary trees is available in C and R from https://github.com/kshirley/summarytrees, so I've been playing with it a little (I don't normally use R but there was little choice here). As an example I created a simple tree for animals, based on the Catalogue of Life. I took a few phyla and classes and built a tree as a CSV file (see the gist). The file lists each node (uniquely numbered), its parent node (the parent of the root of the tree is "0"), a label, and a weight. For an internal node the weight is always 0, for a leaf the weight can be assigned in various ways. By default you could assign each leaf a weight of 1, but if the "leaf" node represents more than one thing (for example, the class Mammalia) then you can give it the number of species in that class (e.g., 5939). You could also assign weights based on some other measure, such as "popularity". In the gist I got bored and only added species counts for a few taxa, everything else was set to 1.

I then loaded the tree into R and found a summary tree for k=30 (the script is in the gist):

This doesn't look too bad (note as I said above, I didn't fill in all the actual species counts because reasons). If I wanted to convert this into a menu such as the one the Victoria Fauna app uses I would simply list the leaf nodes in order, skipping over those labelled "n others", which would give me:

Mammalia
Amphibia
Reptilia
Aves
Actinopterygii
Hemiptera
Hymenoptera
Lepidoptera
Diptera
Coleoptera
Arachnida
Acanthocephala
Nemertea
Rotifera
Porifera
Platyhelminthes
Nematoda
Mollusca

These 18 taxa are not a bad starting point for a menu, especially if we added pictures from PhyloPic to liven it up. There are probably a couple of animal groups that could be added to make it a bit more inclusive.

Because the technique is automated and fast, it would be straightforward to create submenus for major taxa, with the added advantage that you don't beed to make decisions based whether you know anything about that taxonomic group, it can be driven entirely by species counts (for example). We could also use other measures for weights, such as number of Google search hits, size of pages on Wikipedia, etc. So far I've barely scratched the surface of what could be done with this tool.

P.S. The R code is:

library(devtools)
install_github("kshirley/summarytrees", build_vignettes = TRUE)

library(summarytrees)

data = read.table('/Users/rpage/Development/summarytrees/animals.csv', header=TRUE,sep=",")

g <- greedy(node = data[, "node"], 
            parent = data[, "parent"], 
            weight = data[, "weight"], 
            label = data[, "label"], 
            K = 30)
            write.csv(g$summary.trees[[30]], '/Users/rpage/Development/summarytrees/summary.csv')

The gist has the data file, and a simple PHP program to convert the output into a dot file to be viewed with GraphViz.

Tuesday, May 18, 2021

Preprint on Wikidata and the bibliography of life

Last week I submitted a manuscript entitled "Wikidata and the bibliography of life". I've been thinking about the "bibliography of life" (AKA a database of every taxonomic publication ever published) for a while, and this paper explores the idea that Wikidata is the place to create this database. The preprint version is on bioRxiv (doi:10.1101/2021.05.04.442638). Here's the abstract:

Biological taxonomy rests on a long tail of publications spanning nearly three centuries. Not only is this literature vital to resolving disputes about taxonomy and nomenclature, for many species it represents a key source - indeed sometimes the only source - of information about that species. Unlike other disciplines such as biomedicine, the taxonomic community lacks a centralised, curated literature database (the “bibliography of life”). This paper argues that Wikidata can be that database as it has flexible and sophisticated models of bibliographic information, and an active community of people and programs (“bots”) adding, editing, and curating that information. The paper also describes a tool to visualise and explore bibliography information in Wikidata and how it links to both taxa and taxonomists.

The manuscript summarises some work I've been doing to populate Wikidata with taxonomic publications (building on a huge amount of work already done), and also describes ALEC which I use to visualise this content. I've made various (unreleased) knowledge graphs of taxonomic information (and one that I have actually released Ozymandias), I'm still torn between whether the future is to invest more effort in Wikidata, or construct lighter, faster, domain specific knowledge graphs for taxonomy. I think the answer is likely to be "yes".

Meantime, one chart I quite like from the submitted version of this paper is shown below.

It's a chart that is a bit tricky to interpret. My goal was to get a sense of whether bibliographic items added to Wikidata (e.g., taxonomic papers) were actually being edited by the Wikidata community, or whether they just sat there unchanged since they were added. If people are editing these publications, for example, by adding missing author names, linking papers to items for their authors, or adding additional identifiers (such as DOIs, ZooBank identifiers, etc.), then there is clear value in using Wikidata as a repository of bibliographic data. So I grabbed a sample of 1000 publications, retrieved their edit history from Wikidata, and plotted the creation timestamp of each item against the timestamps for each edit made to that item. If items were never edited then every point would fall along the diagonal line. If edits are made, they appear to the right of the diagonal. I could have just counted edits made, but I wanted to visualise those edits. As the chart shows, there is quite a lot of editing activity, so there his a community of people (and bots) curating this content. In many ways this is the strongest argument for using Wikidata for a "bibliography of life". Any database needs curation, which means people, and this is what Wikidata offers, a community of people who care about often esoteric details, and get pleasure from improving structured data.

There are still huge gaps in Wikidata's coverage of the taxonomic literature. Once you move beyond the "low hanging fruit" of publications with CrossRef DOIs the task of adding literature to Wikidata gets a bit more complicated. Then there is the reconciliation problem: given an existing taxonomic database with a list of references, how do we match those references to the corresponding items in Wikidata? There is still a lot to do.