Friday, July 23, 2021

Species Cite: linking scientific names to publications and taxonomists

I've made Species Cite live. This is a web site I've been working on with the GBIF Challenge as a notional deadline so I'll actually get something out the door.

"Species Cite" takes as its inspiration the suggestion that citing original taxonomic descriptions (and subsequent revisions) would increase citation metrics for taxonomists, and give them the credit they deserve. Regardless of the merits of this idea, it is difficult to implement because we don’t have an easy way of discovering which paper we should cite. Species Cite tackles this by combining millions of taxonomic name records linked to LSIDs with bibliographic data from Wikidata to make it easier to cite the sources of taxonomic names. Where possible it provides access to PDFs for articles using Internet Archive, or Unpaywall. These can be displayed in an embedded PDF viewer. Given the original motivation of surfacing the work of taxonomists, Species Cite also attempts to display information about the authors of a taxonomic paper, such as ORCID and/or Wikidata identifiers, and an avatar for the author via either Wikidata or ResearchGate. This enables us to get closer to the kind of social interface found in citizen science projects like iNaturalist where participants are people with personalities, not simply strings of text. Furthermore by identifying people and associating them with taxa it could help us discover who are the experts on particular taxonomic groups, and also enable those people to easily establish that they are, in fact, experts.

How it works

Under the hood there's a lot of moving pieces. The taxonomic names come from a decade or more of scraping LSIDs from various taxonomic databases, primarily ION, IPNI, Index Fungorum, and Nomenclator Zoologicus. Given that these LSIDs are often offline I built caches one and two to make them accessible (see It's been a while...).

The bibliographic data is stored in Wikidata, and I've built an app to explore that data (see Wikidata and the bibliography of life in the time of coronavirus) and also a simple search engine to find things quickly (see Towards a WikiCite search engine). I've also spent way more than I'd care to admit adding taxonomic literature to Wikidata and Internet Archive.

The map between names and literature his based on work I've done with BioNames and various unpublished projects.

To make things a bit more visually interesting I've used images of taxa from Phylopic, and also harvested images from ResearchGate to supplement the rather limited number of images of taxonomists in Wikidata.

One of the things I've tried to do is avoid making new databases, as those often die from neglect. Hence the use of Wikidata for bibliographic data. The taxonomic data is held in static files in the LSID caches. The mapping between names and publications is a single (large) tab-delimited file that is searched on disk using a crude binary search on a sorted list of taxonomic names. This means you can download the Github repository and be up and running without installing a database. Likewise the LSID caches use static (albeit compressed) files. The only database directly involved is the WikiCite search engine.

Once all the moving bits come together, you can start to display things like a plant species together with it's original description and the taxonomists who decsribed that species (Garcinia nuntasaenii):

What's next

There is still so much to do. I need to add lots of taxonomic literature from my own BioStor project and other sources, and the bibliographic data in Wikidata needs constant tending and improving (which is happening, see Preprint on Wikidata and the bibliography of life). And at some point I need to think about how to get the links between names and literature into Wikidata.

Anyway, the web site is live at https://species-cite.herokuapp.com.

Update

I've created a 5 min screencast walking you through the site.

Thursday, July 22, 2021

Towards a WikiCite search engine

I've released a simple search engine for publications in Wikidata. Wikicite Search takes its name from the WikiCite project, which was an initiative to create a bibliographic database in Wikidata. Since bibliographic data is a core component of taxonomic research (arguably taxonomy is mostly tracing the fate of the "tags" we call taxonomic names) I've spent some time getting taxonomic literature into Wikidata. Since there are bots already adding articles by harvesting sources such as CrossRef and PubMed, I've focussed on literature that is harder to add, such as articles with non-CrossRef DOIs, or those without DOIs at all.

Once you have a big database, you are then faced with the challenge of finding things in that database. Wikidata supports generic search, but I wanted something more closely geared to bibliographic data. Hence Wikicite Search. Over the last few years I've made several attempts at a bibliographic search engine, for this project I've finally settled on some basic ideas:

  1. The core data structure is CSL-JSON, a simple but rich JSON format for expressing bibliographic data.
  2. The search engine is Elasticsearch. The documents I upload include the CSL-JSON for an article, but also a simple text representation of the article. This text representation may include multiple languages if, for example, the article has a title in more than one language. This means that if an article has both English and Chinese titles you can find it searching in either language.
  3. The web interface is very simple: search for a term, get results. If the search term is a Wikidata identifier you get just the corresponding article, e.g. Q98715368.
  4. There is a reconciliation API to help match articles to the database. Paste in one citation per line and you get back matches (if found) for each citation.
  5. Where possible I display a link to a PDF of the article, which is typically stored in the Internet Archive or accessible via the Wayback Machine.

There are millions of publications in Wikidata, currently less than half a million are in my search engine. My focus is narrowly on eukaryote taxonomy and related topics. I will be adding more articles as time permits. I also periodically reload existing articles to capture updates to the metadata made by the Wikidata community - being a wiki the data in Wikidata is constantly evolving.

My goal is to have a simple search tool that focusses on matching citation strings. In other words, it is designed to find a reference you are looking for, rather than be a tool to search the taxonomic literature. If that sounds contradictory, consider that my tool will only find a paper about a taxon if it is explicitly named in the title. A more sophisticated search engine would support things like synonym resolution, etc.

The other reason I built this is to provide an API for accessing Wikidata items and displaying them in other formats. For example, an article in the WikiCite search engine can be retrieved in CSL-JSON format, or in RDF as JSON-LD.

As always, it's very early days. But I don't think it's unreasonable to imagine that as Wikidata grows we could envisage having a search engine that includes the bulk of the taxonomic literature.

Citation parsing tool released

Quick note on a tool I've been working on to parse citations, that is to take a series of strings such as:

  • Möllendorff O (1894) On a collection of land-shells from the Samui Islands, Gulf of Siam. Proceedings of the Zoological Society of London, 1894: 146–156.
  • de Morgan J (1885) Mollusques terrestres & fluviatiles du royaume de Pérak et des pays voisins (Presqúile Malaise). Bulletin de la Société Zoologique de France, 10: 353–249.
  • Morlet L (1889) Catalogue des coquilles recueillies, par M. Pavie dans le Cambodge et le Royaume de Siam, et description ďespèces nouvelles (1). Journal de Conchyliologie, 37: 121–199.
  • Naggs F (1997) William Benson and the early study of land snails in British India and Ceylon. Archives of Natural History, 24:37–88.

and return structured data. This is an old problem, and pretty much a "solved" problem. See for example AnyStyle. I've played with AnyStyle and it's great, but I had to install it on my computer rather than simply use it as a web service. I also wanted to explore the approach a bit more as a possible a model for finding citations of specimens.

After trying to install the underlying conditional random fields (CRF) engine used by AnyStyle and running into a bunch of errors, I switched to a tool I could get working, namely CRF++. After figuring out how to compiling a C++ application to run on Heroku I started to wonder how to use this as the basis of a citation parser. Fortunately, I had used the Perl-based ParsCit years ago, and managed to convert the relevant bits to PHP and build a simple web service around it.

Although I've abandoned the Ruby-based AnyStyle I do use AnyStyle's XML format for the training data. I also built a crude editor to create small training data sets that uses a technique published by the author of the blogging tool I'm using to write this post (see MarsEdit Live Source Preview). Typically I use this to correctly annotate examples where the parser failed. Over time I add these to the training data and the performance gets better.

This is pretty much a side project or a side project, but ultimately the goal is to employ it to help extract citation data from publications, both to generate data to populate (BioStor), and also start to flesh out the citation graph for publications in Wikidata.

If you want to play with the tool it is at https://citation-parser.herokuapp.com. At the moment it takes some citation strings and returns the result in CSL-JSON, which is becoming the default way to represent structured bibliographic data. Code is on GitHub.

Tuesday, June 15, 2021

Compiling a C++ application to run on Heroku

TL;DR Use a buildpack and set "LDFLAGS=--static" --disable-shared

I use Heroku to host most of my websites, and since I mostly use PHP for web development this has worked fine. However, every so often I write an app that calls an external program written in, say, C++. Up until now I've had to host these apps on my own web servers. Today I finally bit the bullet and learned how to add a C++ program to a Heroku-hosted site.

In this case I wanted to add CRF++ to an app for parsing citations. I'd read on Stack Overflow that you could simply log into your Heroku instance using

heroku run bash
and compile the code there. I tried that for CRF++ but got a load of g++ errors, culminating in:

configure: error: Your compiler is not powerful enough to compile CRF++.

Turns out that the g++ compiler is only available at build time, that is, when the Heroku instance is being built before it is deployed. Once it is deployed g++ is no longer available (I'm assuming because Heroku tries to keep each running instance as small as possible).

So, next I tried using a buildpack, specifically felkr/heroku-buildpack-cpp. I forked this buildpack, and added it to my Heroku app (using the "Settings" tab). I put the source code for CRF++ into the root folder of the GitHub repository for the app (which makes things messy but this is where the buildpack looks for either Makefile or configure) then when the app is deployed CRF++ is compiled. Yay! Update: with a couple of tweaks I moved all the code into a folder called src and now things are a bit tidier.

Not so fast, I then did

heroku run bash
and tried running the executable:

heroku run bash -a <my app name>
./crf_learn
/app/.libs/crf_learn: error while loading shared libraries: libcrfpp.so.0: cannot open shared object file: No such file or directory

For whatever reason the executable is looking for a shared library which doesn’t exist (this brought back many painful memories of dealing with C++ compilers on Macs, Windows, and Linux back in the day). To fix this I edited the buildpack compile script to set the "LDFLAGS=--static" --disable-shared flags for configure. This tells the compiler to build static versions of the libraries and executable. Redeploying the app once again everything now worked!

The actual website itself is a mess at the moment so I won't share the link just yetUpdate: see Citation parsing tool released for details, but it's great to know that I can have both a C++ executable and a PHP script hosted together without (too much) pain. As always, Google and Stack Overflow are your friends.

Friday, June 04, 2021

Thoughts on BHL, ALA, GBIF, and Plazi

If you compare the impact that BHL and Plazi have on GBIF, then it's clear that BHL is almost invisible. Plazi has successfully in carved out a niche where they generate tens of thousands of datasets from text mining the taxonomic literature, whereas BHL is a participant in name only. It's not as if BHL lacks geographic data. I recently added back a map display in BioStor where each dot is a pair of latitude and longitude coordinates mentioned in an article derived from BHL's scans.

This data has the potential to fill in gaps in our knowledge of species distributions. For example, the Atlas of Living Australia (ALA) shows the following map for the cladoceran (water flea) Simocephalus:

Compare this to the localities mentioned in just one paper on this genus:

Franklin, D. C., Michael, B., & Mace, M. (2005). New location records for some butterflies of the Top End and Kimberley regions. Northern Territory Naturalist, 18, 1–7. Retrieved from https://biostor.org/reference/254167

There are records in this paper for species that currently have no records at all in ALA (e.g., Simocephalus serrulatus):

As it stands BioStor simply extracts localities, it doesn't extract the full "material citation" from the text (that is, the specimen code, date collected, locality, etc. for each occurrence). If it did, it would then be in a position to contribute a large amount of data to ALA and GBIF (and elsewhere). Not only that, if it followed the Plazi model this contribution would be measurable (for example, in terms of numbers of records added, and numbers of data citations). Plazi makes some of its parsing tools available as web services (e.g., http://tb.plazi.org/GgWS/wss/test and https://github.com/gsautter/goldengate-webservices), so in principle we could parse BHL content and extract data in a form usable by ALA and GBIF.

Notes on Plazi web service

The endpoint is http://tb.plazi.org/GgWS/wss/invokeFunction and it accepts POST requests, e.g. data=Namibia%3A%2058%20km%20W%20of%20Kamanjab%20Rest%20Camp%20on%20road%20to%20Grootberg%20Pass%20%2819%C2%B038%2757%22S%2C%2014%C2%B024%2733%22E%29&functionName=GeoCoordinateTaggerNormalizing.webService&dataUrl=&dataFormat=TXT and returns XML.

Friday, May 28, 2021

Finding citations of specimens

Note to self.

The challenge of finding specimen citations in papers keeps coming around. It seems that this is basically the same problem as finding citations to papers, and can be approached in much the same way.

If you want to build a database of reference from scratch, one way is to scrape citations from papers (e.g., from the "literature cited" section), convert those strings into structured data, and add those to your database. In the early days of bibliographic searching this was a common strategy (and I still use it to help populate Wikidata).

Regular expressions are powerful but also brittle, you need to keep tweaking them to accommodate all the minor ways citation styles can differ. This leads to more sophisticated (and hopefully robust) approaches, such as machine learning. Conditional random fields (CRF) are a popular technique, pioneered by tools like Parscite and most recently used in the very elegant anystyle.io. You paste in a citation string and you get back that citation with all the component parts (authors, title, journal, pagination, etc.) separated out. Approaches like this require training data to teach the parser how to recognise the parts of a citation string. One obvious way to generate training data is to have a large bibliographic database, a set of "style sheets" describing all the ways different journals represent citations (e.g., citationstyles.org), and then you can generate lots of training data.

Over time the need for citation parsing has declined somewhat, being replaced by simple fulltext search (exemplified by this Tweet);

Again, in the early days a common method of bibliographic search was to search by keys such as journal name (or ISSN), volume number, and starting page. So you had to atomise the reference into its parts, then search for something that matched those parts. This is tedious (OpenURL anyone?), but helps reduce false matches. If you only have a small bibliographic database searching for reference by string matching can be frustrating because you are likely to get lots of matches, but none of them to the reference you are actually looking for. Given how search works you'll pretty much always get some sort of match. What really helps is if the database has the answer to your search (this is one reason Google is so great, if you have indexed the whole web chances are you have the answer somewhere already). Now that CrossRef's database has grown much larger you can search for a reference using a simple string search and be reasonably confident of getting the a genuine hit. The need to atomise a reference for searching is disappearing.

So, armed with a good database and good search tools we can avoid parsing references. Search also opens up other possibilities, such as finding citations using full text search. Given a reference how do you find where it's been cited? One approach is to parse the text of a reference (A), extract the papers in the "literature cited" section (B, C, D, etc.), match those to a database, and add the "A cites B", "A cites C", etc. links to the database. This will answer "what papers does A cite?" but not "what other papers cite C?". One approach to that question would be to simply take the reference C, convert it to a citation string, then blast through all the full text you could find looking for matches to that citation string - these are likely to be papers that cite reference C. In other words, you are finding that string in the "literature cited" section of the citing papers.

So, to summarise:

  1. To recognise and extract citations as structured data from text we can use regular expressions and/or machine learning.
  2. Training data for machine learning can be generated from existing bibliographic data coupled with rules for generating citation strings.
  3. As bibliographic databases grow in size the need for extracting and parsing citations diminishes. Our databases will have most of the citations already, so that using search is enough to find what we want.
  4. To build a citation database we can parse the literature cited section and extract all references cited by a paper ("X cites")
  5. Another approach to building a citation database is to tackle the reverse question, namely "X is cited by". This can be done by a full text search for citation strings corresponding to X.

How does this relate to citing specimens you ask? Well, I think the parallels are very close:

  • We could use CRF approaches to have something like anystyle.io for specimens. Paste in a specimen from the "Materials examined" section of a paper and have it resolved into its component parts (e.g., collector, locality, date).
  • We have a LOT of training data in the form of GBIF. Just download data in Darwin Core format, apply various rules for how specimens are cited in the literature, and we have our training data.
  • Using our specimen parser we could process the "Materials examined" section of a paper to find the specimens (Plazi extracts specimens from papers, although it's not clear to me how automated this is.)
  • We could also do the reverse: take a Darwin Core Archive for, say, a single institution, generate all the specimen citation strings you'd expect to see people use in their papers, then go search through the full text of papers (e.g., in PubMed Central and BHL) looking for those strings - those are citations of your specimens.

There seems a lot of scope for learning from the experience of people working with bibliographic citations, especially how to build parsers, and the role that "stylesheets" could play in helping to understand how people cite specimens. Obviously, a lot of this would be unnecessary if there was a culture of using and citing persistent identifiers for specimens, but we seem to be a long way from that just yet.

Maximum entropy summary trees to display higher classifications

A challenge in working with large taxonomic classifications is how you display them to the user, especially if the user probably doesn't want all the gory details. For example, the Field Guide app to Victorian Fauna has a nice menu of major animal groups:

This includes both taxonomic and ecological categories (e.g., terrestrial, freshwater, etc.) and greatly simplifies the animal tree of life, but it is a user-friendly way to start browsing a larger database of facts about animals. It would be nice if we could automate constructing such lists, especially for animal groups where the choices of what to display might not seem obvious (everyone wants to see birds, but what insects groups would you prioritise?).

One way to help automate these sort of lists is to use summary trees (see also Karloff, H., & Shirley, K. E. (2013). Maximum Entropy Summary Trees. Computer Graphics Forum, 32(3pt1), 71–80. doi:10.1111/cgf.12094). A summary tree takes a large tree and produces a small summary for k nodes, where< i>k is a number that you supply. In other words, if you want your summary to have 10 nodes then k=10. The diagram below summarises an organisation chart for 43,134 employees.

Summary trees show only a subset of the nodes in the complete tree. All the nodes with a given parent that aren't displayed get aggregated into a newly created "others" node that is attached to that parent. Hence the summary tree alerts the user that there are nodes which list in the full tree but which aren't shown.

Code for maximum entropy summary trees is available in C and R from https://github.com/kshirley/summarytrees, so I've been playing with it a little (I don't normally use R but there was little choice here). As an example I created a simple tree for animals, based on the Catalogue of Life. I took a few phyla and classes and built a tree as a CSV file (see the gist). The file lists each node (uniquely numbered), its parent node (the parent of the root of the tree is "0"), a label, and a weight. For an internal node the weight is always 0, for a leaf the weight can be assigned in various ways. By default you could assign each leaf a weight of 1, but if the "leaf" node represents more than one thing (for example, the class Mammalia) then you can give it the number of species in that class (e.g., 5939). You could also assign weights based on some other measure, such as "popularity". In the gist I got bored and only added species counts for a few taxa, everything else was set to 1.

I then loaded the tree into R and found a summary tree for k=30 (the script is in the gist):

This doesn't look too bad (note as I said above, I didn't fill in all the actual species counts because reasons). If I wanted to convert this into a menu such as the one the Victoria Fauna app uses I would simply list the leaf nodes in order, skipping over those labelled "n others", which would give me:

  • Mammalia
  • Amphibia
  • Reptilia
  • Aves
  • Actinopterygii
  • Hemiptera
  • Hymenoptera
  • Lepidoptera
  • Diptera
  • Coleoptera
  • Arachnida
  • Acanthocephala
  • Nemertea
  • Rotifera
  • Porifera
  • Platyhelminthes
  • Nematoda
  • Mollusca

These 18 taxa are not a bad starting point for a menu, especially if we added pictures from PhyloPic to liven it up. There are probably a couple of animal groups that could be added to make it a bit more inclusive.

Because the technique is automated and fast, it would be straightforward to create submenus for major taxa, with the added advantage that you don't beed to make decisions based whether you know anything about that taxonomic group, it can be driven entirely by species counts (for example). We could also use other measures for weights, such as number of Google search hits, size of pages on Wikipedia, etc. So far I've barely scratched the surface of what could be done with this tool.

P.S. The R code is:

library(devtools)
install_github("kshirley/summarytrees", build_vignettes = TRUE)

library(summarytrees)

data = read.table('/Users/rpage/Development/summarytrees/animals.csv', header=TRUE,sep=",")

g <- greedy(node = data[, "node"], 
            parent = data[, "parent"], 
            weight = data[, "weight"], 
            label = data[, "label"], 
            K = 30)
            write.csv(g$summary.trees[[30]], '/Users/rpage/Development/summarytrees/summary.csv') 

The gist has the data file, and a simple PHP program to convert the output into a dot file to be viewed with GraphViz.

Tuesday, May 18, 2021

Preprint on Wikidata and the bibliography of life

Last week I submitted a manuscript entitled "Wikidata and the bibliography of life". I've been thinking about the "bibliography of life" (AKA a database of every taxonomic publication ever published) for a while, and this paper explores the idea that Wikidata is the place to create this database. The preprint version is on bioRxiv (doi:10.1101/2021.05.04.442638). Here's the abstract:

Biological taxonomy rests on a long tail of publications spanning nearly three centuries. Not only is this literature vital to resolving disputes about taxonomy and nomenclature, for many species it represents a key source - indeed sometimes the only source - of information about that species. Unlike other disciplines such as biomedicine, the taxonomic community lacks a centralised, curated literature database (the “bibliography of life”). This paper argues that Wikidata can be that database as it has flexible and sophisticated models of bibliographic information, and an active community of people and programs (“bots”) adding, editing, and curating that information. The paper also describes a tool to visualise and explore bibliography information in Wikidata and how it links to both taxa and taxonomists.

The manuscript summarises some work I've been doing to populate Wikidata with taxonomic publications (building on a huge amount of work already done), and also describes ALEC which I use to visualise this content. I've made various (unreleased) knowledge graphs of taxonomic information (and one that I have actually released Ozymandias), I'm still torn between whether the future is to invest more effort in Wikidata, or construct lighter, faster, domain specific knowledge graphs for taxonomy. I think the answer is likely to be "yes".

Meantime, one chart I quite like from the submitted version of this paper is shown below.

It's a chart that is a bit tricky to interpret. My goal was to get a sense of whether bibliographic items added to Wikidata (e.g., taxonomic papers) were actually being edited by the Wikidata community, or whether they just sat there unchanged since they were added. If people are editing these publications, for example, by adding missing author names, linking papers to items for their authors, or adding additional identifiers (such as DOIs, ZooBank identifiers, etc.), then there is clear value in using Wikidata as a repository of bibliographic data. So I grabbed a sample of 1000 publications, retrieved their edit history from Wikidata, and plotted the creation timestamp of each item against the timestamps for each edit made to that item. If items were never edited then every point would fall along the diagonal line. If edits are made, they appear to the right of the diagonal. I could have just counted edits made, but I wanted to visualise those edits. As the chart shows, there is quite a lot of editing activity, so there his a community of people (and bots) curating this content. In many ways this is the strongest argument for using Wikidata for a "bibliography of life". Any database needs curation, which means people, and this is what Wikidata offers, a community of people who care about often esoteric details, and get pleasure from improving structured data.

There are still huge gaps in Wikidata's coverage of the taxonomic literature. Once you move beyond the "low hanging fruit" of publications with CrossRef DOIs the task of adding literature to Wikidata gets a bit more complicated. Then there is the reconciliation problem: given an existing taxonomic database with a list of references, how do we match those references to the corresponding items in Wikidata? There is still a lot to do.

Tuesday, April 06, 2021

It's been a while...

Is it's been a while since I've blogged here. The last few months have been, um, interesting for so many reasons. Meanwhile in my little corner of the world there's been the constant challenge of rethinking how to teach online-only, whilst also messing about with a bunch of things in a rather unfocused way (and spending way too much time populating Wikidata). So here I'll touch on a few rather random topics that have come up in the last few months, and elsewhere on this blog I'll try and focus on some of the things that I'm working on. In many ways this blog post is really to serve as a series of bookmarks for things I'd like to think about a bit more.

Taxonomic precision and the "one tree"

One thing that had been bugging me for a while was my inability to find the source of a quote about taxonomic precision that I remembered as a grad student. I was pretty sure that David Penny and Mike Handy had said it, but where? Found it at last:

Biologists seem to seek “The One Tree” and appear not to be satisfied by a range of options. However, there is no logical difficulty with having a range of trees. There are 34,459,425 possible trees for 11 taxa (Penny et al. 1982), and to reduce this to the order of 10-50 trees is analogous to an accuracy of measurement of approximately one part in 106.

Many measurements in biology are only accurate to one or two significant figures and pale when compared to physical measurements that may be accurate to 10 significant figures. To be able to estimate an accuracy of one tree in 106 reflects the increasing sophistication of tree reconstruction methods. (Note that, on this argument, to identify an organism to a species is also analogous to a measurement with an accuracy of approximately one in 106.). — "Estimating the reliability of evolutionary trees" p.414 doi:10.1093/oxfordjournals.molbev.a040407

I think this quote helps put taxonomy and phylogeny in the broader context of quantitative biology. Building trees that accurately place taxa is a computationally challenging task that yields some of the most precise measurements in biology.

Barcodes for everyone

This is yet another exciting paper from Rudolf Meier's lab (see earlier blog post Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden). The preprint doi:10.1101/2021.03.09.434692 is on bioRxiv. It feels like we are getting ever-closer to the biodiversity tricorder.

Barcodes for Australia

Donald Hobern (@dhobern) has been blogging about insects collected in malaise traps in Aranda, Australian Capital Territories (ACT). The insects are being photographed (see stream on Flickr) and will be barcoded.

No barcodes please we're taxonomists!

A paper with a title like "Minimalist revision and description of 403 new species in 11 subfamilies of Costa Rican braconid parasitoid wasps, including host records for 219 species" (Harvey et al. doi:10.3897/zookeys.1013.55600 was always likely to cause problems, and sure enough some taxonomists had a meltdown. A lot of the arguments centered around whether DNA sequences counted as words, which seems surreal. DNA sequences are strings of characters, just like natural language. Unlike English, not all languages have word breaks. Consider Chinese for example, where search engines can't break text up into words for indexing, but instead use n-grams. I mention this simply because n-grams are a useful way to index DNA sequences and to compute sequence similarly without performing a costly sequence alignment. I used this technique in my DNA barcode browser. If we move beyond arguments about whether a picture and a DNA sequence is enough to describe a species (if all species every discovered were described this way we'd arguably be much better off than we are now) I think there is a core issue here, namely the relative size of the intersection between taxa that have been described classically (i.e., with words) and those described almost entirely by DNA (e.g., barcodes) will likely drop as more and more barcoding is done, and this has implications for how we do biology (see Dark taxa: GenBank in a post-taxonomic world).

Bioschema

The dream of linked data rumbles on. Schema.org is having a big impact on standardising basic metadata encoded in web sites, so much so that anyone building a web site now needs to be familiar with schema.org if you want your site to do well in search engine rankings. I made extensive use of schema.org to model bibliographic data on Australian animals for my Ozymandias project.

Bioschemas aims to provide a biology-specific extension to schema.org, and is starting to take off. For example, GBIF pages for species now have schema.org embedded as JSON-LD, e.g. the page for Chrysochloris visagiei Broom, 1950 has this JSON-LD embedded in a <script type="application/ld+json"> tag:

{ "@context": [ "https://schema.org/", { "dwc": "http://rs.tdwg.org/dwc/terms/", "dwc:vernacularName": { "@container": "@language" } } ], "@type": "Taxon", "additionalType": [ "dwc:Taxon", "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept" ], "identifier": [ { "@type": "PropertyValue", "name": "GBIF taxonKey", "propertyID": "http://www.wikidata.org/prop/direct/P846", "value": 2432181 }, { "@type": "PropertyValue", "name": "dwc:taxonID", "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID", "value": 2432181 } ], "name": "Chrysochloris visagiei Broom, 1950", "scientificName": { "@type": "TaxonName", "name": "Chrysochloris visagiei", "author": "Broom, 1950", "taxonRank": "SPECIES", "isBasedOn": { "@type": "ScholarlyArticle", "name": "Ann. Transvaal Mus. vol.21 p.238" } }, "taxonRank": [ "http://rs.gbif.org/vocabulary/gbif/rank/species", "species" ], "dwc:vernacularName": [ { "@language": "eng", "@value": "Visagie s Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "", "@value": "Visagie's golden mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "deu", "@value": "Visagie-Goldmull" } ], "parentTaxon": { "@type": "Taxon", "name": "Chrysochloris Lacépède, 1799", "scientificName": { "@type": "TaxonName", "name": "Chrysochloris", "author": "Lacépède, 1799", "taxonRank": "GENUS", "isBasedOn": { "@type": "ScholarlyArticle", "name": "Tabl. Mamm. p.7" } }, "identifier": [ { "@type": "PropertyValue", "name": "GBIF taxonKey", "propertyID": "http://www.wikidata.org/prop/direct/P846", "value": 2432177 }, { "@type": "PropertyValue", "name": "dwc:taxonID", "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID", "value": 2432177 } ], "taxonRank": [ "http://rs.gbif.org/vocabulary/gbif/rank/genus", "genus" ] } }

For more details on the potential of Bioschemas see Franck Michel's TDWG Webinar.

OCR correction

Just a placeholder to remind me to revisit OCR correction and the dream of a workflow to correct text for BHL. I came across hOCR-Proofreader (which has a Github repo). Internet Archive now provides hOCR files as one of its default outputs, so we're getting closer to a semi-automated workflow for OCR correction. For example, imagine having all this set up on Github so that people can correct text and push those corrections to Github. So close...

Roger Hyam keeps being awesome

Roger just keeps doing cool things that I keep learning from. In the last few months he's been working on a nice interface to the World Flora Online (WFO) which, let's face it, is horrifically ugly and does unspeakable things to the data. Roger is developing a nicer interface and is doing some cool things under the hood with identifiers that inspired me to revisit LSIDs (see below).

But the other thing Roger has been doing is using GraphQL to provide a clean API for the designer working with him to use. I have avoided GraphQL because it couldn't see what problem it solved. It's not a graph query language (despite the name), it breaks HTTP caching, it just seemed that it was the SOAP of today. But, if Roger's using it, I figured there must be something good here (and yes, I'm aware that GraphQL has a huge chunk of developer mindshare). As I was playing with yet another knowledge graph project I kept running into the challenge of converting a bunch of SPARQL queries into something that could be easily rendered in a web page, which is when the utility of GraphQL dawned on me. The "graph" in this case is really a structured set of results that correspond to the information you want to render on a web page. This may be the result of quite a complex series of queries (in my case using SPARQL on a triple store) that nobody wants to actually see. The other motivator was seeing DataCite's use of GraphQL to query the "PID Graph". So, I think I get it now, in the sense that I see why it is useful.

LSIDs back from the dead

In part inspired by Roger Hyam's work on WFO I released a Life Science Identifier (LSID) Resolver to make LSIDs resolvable. I'll spare you the gory details, but you can think of LSIDs as DOIs for taxonomic names. They came complete with a decentralised resolution mechanism (based on Internet domain names) and standards for what information they return (RDF as XML), and millions were minted for animal, fungi, and plant names. For various reasons they didn't really take off (they were technically tricky to use and didn't return information in a form people could read, so what were the chances?). Still, they contain a lot of valuable information for those of us interested in having lists of names linked to the primary literature. Over the years I have been collecting them and wanted a way to make them available. I've chosen a super-simple approach based on storing them in compressed form in GitHub and wrapping that repo in simple web site. Lots of limitations, but I like the idea that LSIDs actually, you know, persist.

DOIs for Biodiversity Heritage Library

In between everything else I've been working with BHL to add DOIs to the literature that they have scanned. Some of this literature is old and of limited scientific value (but sure looks pretty - Nicole Kearney is going to take me to task for saying that), but a lot of it is recent, rich, and scientifically valuable. I'm hoping that the coming months will see a lot of this literature emerge from relative obscurity and become a first class citizen of the taxonomic and biodiversity literature.

Summary

I guess something meaningful and deep should go here... nope, I'm done.