Wednesday, August 19, 2020

Taxonomic concepts continued: All change at ALA and AFD

Continuing my struggles with taxa (see Taxonomic concepts continued: iNaturalist) I now turn to the Atlas of Living Australia (ALA) and the Australian Faunal Directory (AFD), which have perhaps the most fluid taxon identifiers ever. In 2018 I downloaded data from ALA and AFD and used it to create a knowledge graph ("Ozymandias", see GBIF Challenge Entry: Ozymandias and https://ozymandias-demo.herokuapp.com for the web interface to the knowledge graph).

One thing I discovered is that the taxon identifiers used by ALA change... a lot. It almost feels that every time I revisit Ozyamndias and compare it to the ALA things have changed. For example, here is the fly species Acupalpa albimanis (Kröber, 1914) which you can see at https://ozymandias-demo.herokuapp.com/?uri=https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06.





The "https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06" part of the Ozymandias URL is the URL for this species in the Atlas of Living Australia. Well, it was at the time I built Ozymandias (2018). Now (19 August 2020), it is https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:69e1b774-9875-4ff1-ba20-5c4eeed866dc. If you put https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06 into your web browser, you will get redirected to the new URL (under the hood you get a HTTP 302 response with the "Location" header value set to https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:69e1b774-9875-4ff1-ba20-5c4eeed866dc).

So, seemingly, our notion of what Acupalpa albimanis (Kröber, 1914) is has changed since 2018. In fact, ALA itself is out of date, because if you replace the "https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:" bit with "https://biodiversity.org.au/afd/taxa/" you get taken to the AFD site, which is the source of the data for ALA. But ALA says that "https://biodiversity.org.au/afd/taxa/69e1b774-9875-4ff1-ba20-5c4eeed866dc" is old news and is (as of 28 February 2020) "https://biodiversity.org.au/afd/taxa/6f91e39e-8d73-4133-901c-7f4ba1771e30":

So the identifier for this taxon keeps changing. For the life of me I can't figure out why. If I compare the CSV file I downloaded 23 February 2018 with one I downloaded today (19 August 2020, CSV file is available from https://biodiversity.org.au/afd/taxa/THEREVIDAE/names/csv) the only differences are the time stamps for NAME_LAST_UPDATE and TAXON_LAST UPDATE, and the UUIDs used for NAME_GUID, TAXON_GUID, PARENT_TAXON_GUID, and CONCEPT_GUID fields. So, the administrivia has changed. Other than that, the data in the two files for this fly are identical, so why the change in identifiers? It seems bizarre to create identifiers that regularly change (and then ha=ve to maintain the associated redirects to try and keep the old identifiers functional) when the data itself seems unchanged.

Now, AFD isn't the only project to regularly change identifiers, an older version of the Catalogue of Life also did this, although it wasn't always clear to me how they did that - see Catalogue of Life and LSIDs: a catalogue of fail and the paper "Identifying and relating biological concepts in the Catalogue of Life" https://doi.org/10.1186/2041-1480-2-7.

As I disappear further down this rabbit hole (why oh why did I start doing this?) I'm beginning to suspect part of the issue here is versioning (argh!), and what we are seeing is various different ways people are trying to cope with that, the problem of what identifiers should change, and how and when, and what part of the information about a taxon a given taxon identifier should point to (the name, a use of that name, the underlying concept, the database record that tracks a taxon, the latest version of that record, etc.). Given that different databases tackle these issues differently (and not always consistently), the notion that we can easily map these identifiers to each other, and to third party identifiers such as Wikidata seems a bit, um, optimistic...