Thursday, September 08, 2022

Local global identifiers for decentralised wikis

I've been thinking a bit about how one could use a Markdown wiki-like tool such as Obsidian to work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).

One "gotcha" would be how to name pages. If we treat the database as entirely local, then the page names don't matter, but what if we envisage sharing the database, or merging it with others (for example, if we divided a taxon up into chunks, and different people worked on those different chunks)?

This is the attraction of globally unique identifiers. You and I can independently work on the same thing, such as data linked to scientific paper, safe in the knowledge that if we both use the DOI for that paper we can easily combine what we've done. But global identifiers can also be a pain, especially if we need to use a service to look them up ("is there a DOI for this paper?", "what is the LSID for this taxonomic name?").

Life would be easier if we could generate identifiers "locally", but had some assurance that they would be globally unique, and that anyone else generating an identifier for the same thing would arrive at the same identifier (this eliminates things such as UUIDs which are intentionally designed to prvent people genrrating the same identifier). One approach is "content addressing" (see, e.g. Principles of Content Addressing - dead link but in the Wayabck Machine, see also btrask/stronglink). For example, we can generate a cryptographic hash of a file (such as a PDF) and use that as the identifier.

Now the problem is that we have globally unique, but ugly and unfriendly identifiers (such as "6c98136eba9084ea9a5fc0b7693fed8648014505"). What we need are nice, easy to use identifiers we can use as page names. Wikispecies serves as a possible role model, where taxon names serve as page names, as do simplified citations (e.g., authors and years). This model runs into the problem that taxon names aren't unique, nor are author + year combinations. In Wikispecies this is resolved by having a centralised database where it's first come, first served. If there is a name clash you have to create a new name for your page. This works, but what if you have multiple databases un by different people? How do we ensure the identifiers are the same?

Then I remembered Roger Hyam's flight of fantasy over a decade ago: SpeciesIndex.org – an impractical, practical solution. He proposed the following rules to generate a unique URI for a taxonomic name:

  • The URI must start with "http://speciesindex.org" followed by one or more of the following separated by slashes.
  • First word of name. Must only contain letters. Must not be the same as one of the names of the nomenclatural codes (icbn or iczn). Optional but highly recommended.
  • Second word of name. Must only contain letters and not be a nomenclatural code name. Optional.
  • Third word of name. Must only contain letters and not be a nomenclatural code name. Optional.
  • Year of publication. Must be an integer greater than 1650 and equal to or less than the current year. If this is an ICZN name then this should be the year the species (epithet) was published as is commonly cited after the name. If this is an ICBN name at species or below then it is the date of the combination. Optional. Recommended for zoological names if known. Not recommended for botanical names unless there is a known problem with homonyms in use by non-taxonomists.
  • Nomenclatural code governing the name of the taxon. Currently this must be either 'icbn' or 'iczn'. This may be omitted if the code is unknown or not relevant. Other codes may be added to this list.
  • Qualifier This must be a Version 4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for a taxon for which the conventional name is unknown or does not exist or to indicate a particular taxon concept that bears the embedded name.
  • The whole speciesindex.org URI string should be considered case sensitive. Everything should be lower case apart from the first letter of words that are specified as having upper case in their relevant codes e.g. names at and above the rank of genus.

Roger is basically arging that while names aren't unique (i.e., we have homonyms such as Abronia) they are pretty close to being so, and with a few tweaks we can come up with a unique representation. Another way to think about this if we had a database of all taxonomics, we could construct a trie and for each name find the shortest set of name parts (genus, species, etc), year, and code that gave us a unique string for that name. In many cases the species name may be all we need, in other cases we may need to add year and/or nomenclatural code to arrive at a unique string.

What about bibliographic references? Well many of us will have databases (e.g., Endnote, Mendeley, Zotero, etc.) which generate "cite keys". These are typically short, memorable identifiers for a reference that are unique within that database. There is an interesting discussion on the JabRef forum regarding a "Universal Citekey Generator", and source code is available cparnot/universal-citekey-js. I've yet to explore this in detail, but it looks a promising way to generate unique identifiers from basic metadata (echos of more elaborate schemes such as SICIs). For example,

Senna AR, Guedes UN, Andrade LF, Pereira-Filho GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda: Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.
becomes "Senna:2021ck". So if two people have the same, core, metadata for a paper they can generate the same key.

Hence it seems with a few conventions (and maybe some simple tools to support them) we could have decentralised wiki-like tools that used the same identifiers for the same things, and yet those identfiiers were short and human-friendly.