Saturday, June 09, 2007

Rethinking LSIDs versus HTTP URI

The TDWG-GUID mailing list for this month has a discussion of whether TDWG should commit to LSIDs as the GUID of choice. Since the first GUID workshop TDWG has pretty much been going down this route, despite a growing chorus of voices (including mine) that LSIDs are not first class citizens of the Web, and don't play well with the Semantic Web.

Leaving aside political considerations (this stuff needs to be implemented as soon as possible, concerns that if TDWG advocates HTTP URIs people will just treat them as URLs and miss the significance of persistence and RDF, worries that biodiversity will be ghettoised if it doesn't conform what is going on elsewhere), I think there is a way to resolve this that may keep most people happy (or at least, they could live with it). My perspective is driven by trying to separate needs of primary data providers from application developers, and issues of digital preservation.

I'll try and spell out the argument below, but to cut to the chase, I will argue

  1. A GUID system needs to provide a globally unique identifier for an object, and a means of retrieving information about that object.

  2. Any of the current technologies we've discussed (LSIDs, DOIs, Handles) do this (to varying degrees), hence any would do as a GUID.

  3. Most applications that use these GUIDs will use Semantic Web tools, and hence will use HTTP URIs.

  4. These HTTP URIs will be unique to the application, the GUIDs however will be shared

  5. No third party application can serve an HTTP URI that doesn't belong to its domain.

  6. Digital preservation will rely on widely distributed copies of data, these cannot have the same HTTP URI.


From this I think that both parties to this debate are right, and we will end up using both LSIDs and HTTP URIs, and that's OK. Application developers will use HTTP URIs, but will use clients that can handle the various kinds of GUIDs. Data providers will use the GUID technology that is easiest for them to get up and running (for specimen this is likely to be LSIDs, for literature some providers may use Handles via DSpace, some may use URLs).

Individual objects get GUIDs

If individual objects get GUIDs, then this has implications for HTTP URIs. If the HTTP URI is the GUID, an object can only be served from one place. It may be cached elsewhere, but that cached copy can't have the same HTTP URI. Any database that makes use of the HTTP URI cannot serve that HTTP URI itself, it needs to refer to it in some way. This being the case, whether the GUID is a HTTP URI or not starts to look a lot less important, because there is only one place we can get the original data from -- the original data provider. Any application that builds on this data will need it's own identifier if people are going to make use of that application's output.

Connotea as an example

As a concrete example, consider Connotea. This application uses deferenceable GUIDs such as DOIs and Pubmed ids to retrieve publications. DOIs and Pubmed ids are not HTTP URIs, and hence aren't first class citizens of the Web. But Connotea serves its own records as HTTP URIs, and URIs with the prefix "rss" return RDF (like this) and hence can be used "as is" by Semantic Web tools such as Sparql.

If we look at some Connotea RDF, we see that it contains the original DOIs and Pubmed ids.


This means that if two Connotea users bookmark the same paper, we could deduce that they are the same paper by comparing the embedded GUIDs. In the same way, we could combine RDF from Connotea and another application (such as bioGUID) that has information on the same paper. Why not use the original GUIDs? Well, for starters there are two of them (info:pmid/17079492 and info:doi/10.1073/pnas.0605858103) so which to use? Secondly, they aren't HTTP URIs, and if they were we'd go straight to CrossRef or NCBI, not Connotea. Lastly, we loose the important information that the bookmarks are different -- they were made by two different people (or agents).

Applications will use HTTP URIs

We want to go to Connotea (and Connotea wants us to go to it) because it gives us additional information, such as the tags added by users. Likewise, bioGUID adds links to sequences referred to in the paper. Web applications that build on GUIDs want to add value, and need to add value partly because the quality of the original data may suck. For example, metadata provided by CrossRef is limited, DiGIR providers manage to mangle even basic things like dates, and in my experience many records provided by DiGIR sources that lack geocoordinates have, in fact, been georeferenced (based on reading papers about those specimens). The metadata associated with Handles is often appallingly bad, and don't get me started on what utter gibberish GenBank has in its specimen voucher fields.

Hence, applications will want to edit much of this data to correct and improve it, and to make that edited version available they will need their own identifiers, i.e. HTTP URIs. This ranges from social bookmarking tools like Connotea, to massive databases like FreeBase.

Digital durability

Digital preservation is also relevant. How do we ensure our digital records are durable? Well, we can't ensure this (see Clay Shirky's talk at LongNow), but one way to make them more durable is massive redundancy -- multiple copies in many places. Indeed, given the limited functionality of the current GBIF portal, I would argue that GBIFs main role at present is to make specimen data more durable. DiGIR providers are not online 24/7, but if their data are in GBIF those data are still available. Of course, GBIF could not use the same GUID as the URI for that data, like Connotea it would have to store the original GUID in the GBIF copy of the record.

In the same way, the taxonomic literature of ants is unlikely to disappear anytime soon, because a single paper can be in multiple places. For example, Engel et al.'s paper on ants in Cretaceous Amber is available in at least four places:

Which of the four HTTP URIs you can click on should be the GUID for this paper? -- none of them.

LSIDs and the Semantic Web

LSIDs don't play well with the Semantic Web. My feeling is that we should just accept this and move on. I suspect that most users will not interact directly with LSID servers, they will use applications and portals, and these will serve HTTP URIs which are ideal for Semantic Web applications. Efforts to make LSIDs compliant by inserting owl:sameAs statements and rewriting rdf:resource attributes using a HTTP proxy seem to me to be misguided, if for no other reason than one of the strengths of the LSID protocol (no single point of failure, other than the DNS) is massively compromised because if the HTTP proxy goes down (or if the domain name tdwg.org is sold) links between the LSID metadata records will break.

Having a service such as a HTTP proxy that can resolve LSIDs on the fly and rewrite the metadata to become HTTP-resolvable is fine, but to impose an ugly (and possibly short term) hack on the data providers strikes me as unwise. The only reason for attempting this is if we think the original LSID record will be used directly by Semantic web applications. I would argue that in reality, such applications may harvest these records, but they will make them available to others as part of a record with a HTTP URI (see Connotea example).

Conclusions


I think my concerns about LSIDs (and I was an early advocate of LSIDs, see doi:10.1186/1471-2105-6-48) stem from trying to marry them to the Semantic web, which seems the obvious technology for constructing applications to query lots of distributed metadata. But I wonder if the mantra of "dereferenceable identifiers" can sometime get in the way. ISBNs given to books are not, of themselves, dereferenceable, but serve very well as identifiers of books (same ISBN, same book), and there are tools that can retrieve metadata given an ISBN (e.g., LibraryThing).

In a world of multiple GUIDs for the same thing, and multiple applications wanting to talk about the same thing, I think clearly separating identifiers from HTTP URIs is useful. For an application such as Connotea, a data aggregator such GBIF, a database like FreeBase, or a repository like the Internet Archive, HTTP URIs are the obvious choice (If I use a Connotea HTTP URI I want Connotea's data on a particular paper). For GUID providers, there may be other issues to consider.

Note that I'm not saying that we can't use HTTP URIs as GUIDs. In some, perhaps many cases they may well be the best option as they are easy to set up. It's just that I accept that not all GUIDs need be HTTP URIs. Given the arguments above, I think the key thing is to have stable identifiers for which we can retrieve associated metadata. Data providers can focus on providing those, application developers can focus on linking them and their associated metadata together, and repackaging the results for consumption by the cloud.

2 comments:

Anonymous said...

"This means that if two Connotea users bookmark the same paper, we could deduce that they are the same paper by comparing the embedded GUIDs."

Hi Rod, I've been asking connotea to do this, I call it buggotea. They don't seem to be too keen though...

Roger Hyam said...

Hi Rod,

As usually I agree with nearly everything you say but...

I think you say that it is OK for there to be a URL and a sw-unresolvable URN like a DOI or LSID for an object. I agree fully with this.

Now as all these identifiers are opaque to the client the fact that the URL may embed the LSID/DOI doesn't matter. The client should not be able to differentiate between a URL that is a proxied LSID and a URL that isn't. It is just an http URL.

So I don't see how you can logically criticize the use of LSID proxies and simultaneously support multiple GUIDs. They are functionally indistinguishable from the client side.

[BTW: As we (TDWG) also advocate a minimum HTTP GET binding for LSIDs there is an interesting 'feature'. Any LSID will also have an associated, impermanent URL type GUID for its metadata i.e. the URL the getMetadata() method is bound to. There would be nothing to stop people using this in their metadata to differentiate between assertions made about the metadata itself and the object in question c.f. W3C HttpRange 14 debate. I think this is actually something to be encouraged]

All the best,

Roger