Thursday, February 09, 2006

Globally Unique Identifiers

I attended the TDWG-GUID workshop on Global Unique Indenitifers (GUIDs) held at NESCent, which has issued a report. Essentially, the aim of this work is to deploy globally unique identifiers for digital objects in biodiversity informatics, such as taxon names, specimen records, images, etc. The workshop settled on LSIDs (Life Science Identifiers), which is a sensible choice.

LSIDs have been around, and there is considerable software support from IBM (see their project on SourceForge). I've used them in my Taxonomic Search Engine. Not everybody is thrilled by LSIDs (see Anyone using LSID? on NodalPoint).

DOIs and Handles were also considered. I have flirted with handles (see my comments on the iSpecies blog). DOIs have some useful properties, especially stable infrastructure, management tools, and immediate utility by the publishing industry, although they are not cheap. George Garrity uses them in his NamesforLife© project(doi:10.1601/tx.0). Long term the biodiversity community might benefit from thinking seriously about this. The German Science Foundation has invested in providing free DOIs to the German scientific community (see Publication and Citation of Scientific Primary Data). There's also a certain irony in a blog posting talking about GUIDs and rejecting DOIs, when every reference to an external publication is made using, you guessed it, a DOI.

Regarding the workshop itself, at times I wanted to gnaw off parts of my body to retain sanity. As a result I was pretty obnoxious. My frustration stemmed partly from a feeling that the TDWG community seems determined to make life hard for themselves by placing obstacles in their path whenever possible. They've also a lot of investment in XML schema, which I regard as misguided (that's being polite). Anybody who thinks XML schema are the answer to our problems should read "From XML to RDF: how semantic web technologies will change the design of 'omic' standards" (doi:10.1038/nbt1139). I nearly lost it when there was discussion of adopting LSIDs but serving the metadata in XML schema. This defeats the whole point of LSIDs. By serving RDF, we can do inference, in particular we can easily aggregate RDF into triple stores. Populating a database becomes as easy as resolving the LSID and sucking down the metadata. Consequently, data integration suddenly looks a lot more tractable. Indeed, from the perspective of RDF, LSIDs are just another Uniform Resource Identifier (URI), albeit one which consistently resolves to RDF.

As the workshop drew to a close, I began to feel that one reason people just didn't "get" LSIDs and RDF was that there were no really cool examples of what can be done with the technology. If you just look at RDF serialised as XML, then it's not obvious what the big deal is. So we serve a different form of XML, what's the big deal? This is a little like my first impression of XML -- it just seemed like a more fussy version of HTML, so what was all the hype about? Once you see the power of the tools associated with XML (such as the parsers, XSLT and XPath), then you see the point. It can make exchanging and processing data a lot easier, and style XSLT sheets are just way kewl. The difference between XML and RDF is of this order. So, what we need are some cool applications combining LSIDs, metadata, and triple stores to show people just why this is so much more powerful than the XML schema that have obsessed the TDWG community for so long.