Friday, July 17, 2020

Taxonomic concepts for dummies

[Work in progress]

The "dummy" in this case is me. I'm trying to make sense of how to model taxa, especially in the context of linked data, and projects such as Wikidata where there is uncertainty over just what a taxon in Wikidata actually represents. There is also ongoing work by the TDWG Taxon Names and Concepts Interest Group. This is all very rough and I'm still working on this, but here goes.

I'm going to assume that names and publications are fairly unproblematic. We have a "controlled thesaurus of scientific names" to use Franck Michel et al.'s phrase, provided by nomenclators, and we have publications. Both names and publications have identifiers.




Then we have "usages" or "instances", which are (name, publication) pairs. These can be of different level of precision, such as "this name appears in this book", or it "ppears on page 5", or in this paragraph, etc. I tend to view "instances" as like annotations, as if you highlighted a name in some text. Indeed, indexing the taxonomic literature and recording the location of each taxonomic name is essentially generating the set of all instances. We can have identifiers for each instance.



Then we can have relationships between instances. For example, in taxonomic publications you will often see a researcher list previous mentions of a name in other works, synonyms, etc. So we can have links between instances (which is one reason why they need identifiers).





OK, now comes the tricky bit. Up to now things are pretty simple, we have strings (names), we have sets of strings (publications), the location of strings in publications (usages/instances/annotations), and relationships between those instances (e.g., there may be a list of them on the same page in a publication). What about taxa?

It seems to me there are at least two different ways people model taxa in databases. The first is the classic Darwin Core spreadsheet style model of one unique name per row. If a name is "accepted" it has a parent, if it isn't "accepted" then it doesn't have a pointer to a parent, but does have a pointer to an accepted name:




If we treat each row as an instance (e.g., name plus original publication of that name), then the identifier for the instance is the same as the identifier for the taxon. A practical consequence of this is that if the name of a taxon changes, so does the identifier, therefore any data attached to that identifier can be orphaned (or, at least, needs to be reconnected) when there is taxonomic change.

In reality, the rows aren't really usages, they are simply names (mostly) without associated publications. We could be generous and model these as instances where we lack the publication, but basically we have rows of names and pointers between them. So taxa = accepted names. This is the model used by ITIS and GBIF.




Now, a different model is to wrap one or more instances into a set and call that a taxon, and the taxon points to an instance that is contains the currently accepted name. The taxon has a parent that is itself a taxon. This separates taxa and names, and has the practical consequence that we can have a taxon identifier that persists even if the taxonomic name changes. Hence data linked to the taxon remains linked even if the name changes. And every name applied to a taxon has the same taxon identifier. This makes it possible to track changes in classifications using the diff approach I've discussed earlier. This is the model used by (as far as I can make out) the Australian NSL (site seems perpetually offline), eBird, and Avibase, for example.



Wikidata has something very like GBIF, there is essentially no distinction between taxa and names. hence the identifiers associated with a Wikidata "taxon" can be identifiers for names (e.g., from IPNI) or for taxa (e.g., Avibase), with no clear distinction made between these rather different things.

So, what to do, and does this matter? Maybe part of the problem is the notion that identifiers for taxa should be unique, that is, only one Wikidata item can have an Avibase ID. This would work if a Wikidata taxon was, in fact, a taxon (and that it matched the Avibase classification). Perhaps we could treat Wikidata "taxa" as something like an instance, and label each Wikidata item that belongs in the same taxon (according to particular identifier being used) with the same taxon identifier?

More to come...