Thursday, July 09, 2020

Lists of species don't matter: thoughts on "Principles for creating a single authoritative list of the world’s species"

Garnett et al. recently published a paper in PLoS Biology that starts with the sentence "Lists of species matter":

Garnett, S. T., Christidis, L., Conix, S., Costello, M. J., Zachos, F. E., Bánki, O. S., … Thiele, K. R. (2020). Principles for creating a single authoritative list of the world’s species. PLOS Biology, 18(7), e3000736. doi:10.1371/journal.pbio.3000736

This paper (one of a forthcoming series) is pretty much the kind of paper I try and avoid reading. It has lots of authors so it is a paper by committee, those authors all have a stake in particular projects, and it is an acronym soup of organisations the paper is pitched at. It's a well-worn strategy: write one or more papers outlining making the case that there is a problem, then get funding based on the notion that clearly there's a problem (you've published papers saying so) and that you and your co-applicants are best placed to solve it (clearly, because you wrote the papers identifying the problem in the first place). I'm not criticising the strategy, it's how you get things done in science. It just makes for a rather uninspiring read.

From my perspective focussing on "lists" is a mistake. Lists don't really matter, it is what is on the list that counts. And I think this is where the real prize is. As I play with Wikidata I'm becoming increasingly aware of the clusterfuck mess the taxonomic database community has created by conflating taxonomic names with taxa, and by having multiple identifiers for the same things. We survive this mess by relying on taxonomic names as somewhat fuzzy identifiers, and the hope that we can communicate successfully with other people despite this ambiguity (I guess this is pretty much the basis of all communication). As Roger Hyam notes:

These taxon names we are dealing with are really just social tags that need to be organised in a central place.

Having lots of names (tags) is fine, and Wikidata is busy harvesting all these taxonomic names and their identifiers (ITIS, IPNI, NCBI, EOL, iNaturalist, eBird, etc., etc., etc.). For most of these names all we have is a mapping to other identifiers for the same name, a link to a parent taxon, and sometimes a link to a reference for the name. But what happens if we want to attach data to a taxon? Take, for example, the African Piculet Verreauxia africana. This bird has at least two scientific names, each with a separate entry in Wikidata: Verreauxia africana Q28123873 and Sasia africana Q1266812. These are the same species yet it has two entries in Wikidata. If I want to add, say, body weight, or population size, or longevity, which Wikidata item do I add that data too?

What we need is an identifier for the species, an identifier that remains unchanged even if the name changes, or if that species moves in the taxonomic hierarchy. Some databases do this already. For example the eBird identifier for Verreauxia africana/Sasia africana is afrpic1. Because the identifier remains unchanged we can do things such as "diffs" between successive classifications showing how the species has moved between different genera (see Taxonomic publications as patch files and the notion of taxonomic concepts):

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

Ironically it seems that for birds the common name (in this case "African Piculet") is a more stable identifier than the scientific name (although that may well change). By having stable taxon identifiers we can then decide what entity to attach biological data to. Taxonomic names have failed to do this, but are still vital as well known tags. The actual taxon identifiers should be opaque identifiers (like "afrpic1" - not really opaque but close enough - or Avibase's C4DFB5E31495AE94). Make each opaque identifier a DOI, use existing taxonomic names as formalised tags so we aren't disconnected from the literature, use timestamped versions to track changes in species classification over time, and we have something useful.

This, I think, is the real prize. Rather than frame the task as making a list of species so that organisations can have a checklist they can all share, why not frame it as providing a framework that we can hang trait data on? We have vast quantities of data residing in siloed databases, spreadsheets, and centuries of biological literature. The argument shouldn't be about what is on a list, it should be how we group that information together and enable people to do their science. By providing stable identifiers that are resistant to name changes we can confidently associate trait data with taxa. Taxonomy could then actually be what it should be, the organisational framework for biological information (see Taxonomy as Information Science).