Wednesday, April 10, 2013

Time to put taxonomy into GitHub

Donald Hobern drew my attention to nice the way iNaturalist displays taxonomic splits:

Inaturalist
In this example, observations identified as Rhipidura fuliginosa are being split into Rhipidura fuliginosa and Rhipidura albiscapa. This immediately reminds me of the idea which keeps circulating around, namely using version control tools to manage taxonomic classification. Some years ago David Shorthouse proposed managing taxonomic classifications using version control, see Taxonomic Consensus as Software Creation. I discussed this in Taxonomy on a hard disk, and Pierre Lindenbaum has an interesting post on treating the NCBI taxonomy as a file system A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.

The idea is that a taxonomy, such as the GBIF backbone taxonomy, could be placed in GitHub where people could clone it, annotated, correct, edit, or otherwise mess with it, then GBIF could pull in those edits and release an updated, cleaner taxonomy. If software version control seems a bit esoteric, it's worth noting that use of GitHub is rapidly becoming much more mainstream in science, and not just for software development. People are using it to store versions of data analysis (e.g., https://github.com/dwinter/Fungal-Foray) and collaboratively write manuscripts (e.g., https://github.com/weecology/data-sharing-paper). The journal eLIFE is depositing articles there (e.g., https://github.com/elifesciences/elife-articles). In addition to all the infrastructure GitHub provides (the ability to identify who did what and when, to roll back changes, to fork classifications, etc.) there is also the attraction of not creating yet more software, but simply editing a classification by moving folders around on your local filesystem. The idea seems irresistible…