Wednesday, October 04, 2017

TDWG 2017: thoughts on day 3

Day three of TDWG 2017 highlighted some of the key obstacles facing biodiversity informatics.

After a fun series of "wild ideas" (nobody will easily forget David Bloom's "Kill your Darwin Core darlings") we had a wonderful keynote by Javier de la Torre (@jatorre) entitled "Everything happens somewhere, multiple times". Javier is CEO and founder of Carto, which provides tools for amazing geographic visualisations. Javier provided some pithy observations on standards, particularly the fate of official versus unofficial "community" standards (the community standards tend to be simpler, easier to use, and hence win out), and the potentially stifling effects standards can have on innovation, especially if conforming to standards becomes the goal rather than merely a feature.

The session Using Big Data Techniques to Cross Dataset Boundaries - Integration and Analysis of Multiple Datasets demonstrated the great range of things people want to do with data, but made little progress on integration. It still strikes me as bizarre that we haven't made much progress on minting and reusing identifiers for the same entities that we keep referring too. Channeling Steve Balmer:

Identifiers, identifiers, identifiers, identifiers

It's also striking to compare Javier de la Torre's work with Carto where there is a clear customer-driven focus (we need these tools to deliver this to users so that they can do what they want to do) versus the much less focussed approach of our community. Many of the things we aspire to won't happen until we identify some clear benefits for actual users. There's a tendency to build stuff for our own purposes (e.g., pretty much everything I do) or build stuff that we think people might/should want, but very little building stuff that people actually need.

TDWG also has something of an institutional memory problem. Franck Michel gave an elegant talk entitled A Reference Thesaurus for Biodiversity on the Web of Linked Data which discussed how the Muséum national d'Histoire naturelle's taxonomic database could be modelled in RDF (see for example http://taxref.mnhn.fr/lod/taxon/60878/10.0). There's a more detailed description of this work here:

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

What struck me was how similar this was to the now deprecated TDWG LSID vocabulary, still used my most of the major taxonomic name databases (the nomenclatures). This is an instance where TDWG had a nice, workable solution, it lapsed into oblivion, only to be subsequently reinvented. This isn't to take anything away from Frank's work, which has a thorough discussion of the issues, and has a nice way to handle the the difference between asserting that two taxa are the same (owl:equivalentClass) and that a taxon/name hybrid (which is what many databases serve up because they don't distinguish between names and taxa) and a taxon might be the same (linking via the name they both share).

The fate of the RDF served by the nomenclators for the last decade illustrates a point I keep returning too (see also EOL Traitbank JSON-LD is broken). We tend to generate data and standards because it's the right thing to do, rather than because there's actually a demonstrable need for that data and those standards.