Monday, August 17, 2020

Taxonomic concepts continued: iNaturalist

Following on from my earlier post ("Taxonomic concepts for dummies"), Beckett Sterner commented:

Maybe one productive use case would be to look at what it would take for wikidata to handle taxa (=name+concept) in a way that didn't lose relevant taxonomic information when receiving content from platforms like iNaturalist that has a fairly sophisticated strategy https://www.inaturalist.org/pages/taxon_frameworks

iNaturalist is interesting, but I'm not convinced that it is internally consistent. As a quick rule of thumb, I'm looking for patterns of how name changes relate to taxon identifier changes. For example, we can have cases where a database retains the same taxon identifier (the columns) even if names (rows) change (such as eBird or Avibase). For example, if we move a species from one genus to the next, the name changes but (arguably) the taxon itself doesn't (it's still the same set of organisms, just moved to a different part of our classification or, if you like, tagged differently).

Taxa
Names    
   

Then we can have cases like this, where the name (row) is the same but the taxon changes. This might be where we split a taxon in two, and one remaining part retains the original name. So you could argue that the taxon has changed (i.e., in composition) even if the name hasn't.

Taxa
Names    
   

Now, many taxonomic databases seem to something different: every time the name changes we have a new identifier, even if the taxon bearing that name hasn't changed (i.e., it has the same set of organisms as before), so we get a name change and a taxon identifier change:

Taxa
Names    
   

Because we have databases that use different approaches to how they use name and taxon identifiers, life can get complicated, especially for projects such as Wikidata that try and synthesise information across all of these databases.

I haven't looked in detail at iNaturalist, but I have found cases of both

Taxa
Names    
   

and

Taxa
Names    
   

In some cases iNaturalist will change a taxon idenifier even if the name remains the same. For example, the "Thrush-like Schiffornis" Schiffornis turdina https://www.inaturalist.org/taxa/8793 has been split into five taxa, one of which bears the same scientific name (Schiffornis turdina https://www.inaturalist.org/taxa/513975). Given that the composition of Schiffornis turdina has changed, there is an argument to be made that its taxon identifier should change, which is what iNaturalist does.

So, it looks like iNaturalist is using its "taxa" identifiers to identify taxa, but then have cases such as the transfer of the African piculet Sasia africana https://www.inaturalist.org/taxa/18393 to Verreauxia africana https://www.inaturalist.org/taxa/792894, or the transfer of Heraclides rumiko https://www.inaturalist.org/taxa/428606 to Papilio rumiko https://www.inaturalist.org/taxa/509627. In both cases nothing has changed about those species, yet the identifiers have changed (for an example of a "true" taxon identifier, note that NCBI has the same identifier for Sasia africana/Verreauxia africana and for Heraclides rumiko/Papilio rumiko).

My sense is that part of the problem is that we are trying to overload identifiers, which in and of themselves don't tell us much. For example, some might argue that any change in an entity requires a change in the entity's identifier, because the underlying thing has changed. Others might argue that such a change risks making things harder to find (for example, how do we now connect the earlier version of a thing with the newer version, given that the identifier has changed?). In the case of taxonomy, I think we could possibly avoid some of this grief if we acknowledge that names and identifiers have their limits, and that we should decouple them from trying to track changes in the things they point too. Rather, what we could really do with is a timestamped versioning system where we can ask "OK, in 1960, what did this genus look like?". Likewise, when looking at a system such as Wikidata, we shouldn't expect to have a complete view of every taxonomic opinion ever held. But we could aim for a current "snapshot".

Update

I asked a question about this on the iNaturalist forum and it appears that iNaturalist treats taxon ids as essentially rows in a database, each time a name gets added you get a new integer id, and if you decide that a taxon has fundamentally changed (e.g., a species is split into two or more taxa) then you add that new taxon, generating a new integer id.