Tuesday, April 21, 2009

GBIF and Handles: admitting that "distributed" begets "centralized"

The problem with this ... is that my personal and unfashionable observation is that “distributed” begets “centralized.” For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.).
--Geoffrey Bilder interviewed by Martin Fenner

Thinking about the GUID mess in biodiversity informatics, stumbling across some documents about the PILIN (Persistent Identifier Linking INfrastructure) project, and still smarting from problems getting hold of specimen data, I thought I'd try and articulate one solution.

Firstly, I think biodiversity informatics has made the same mistake as digital librarians in thinking that people care where the get information from. We don't, in the sense that I don't care whether I get the information from Google or my local library, I just want the information. In this context local is irrelevant. Nor do I care about individual collections. I care about particular taxa, or particular areas, but not collections (likewise, I may care about philosophy, but not philosophy books at Glasgow University Library). I think the concern for local has lead to an emphasis on providing complex software to each data provider that supports operations (such as search) that don't scale (live federated search simply doesn't work), at the expense of focussing on simple solutions that are easy to use.

In a (no doubt unsuccessful) attempt to think beyond what I want, let's imagine we have several people/organisations with interests in this area. For example:

Imagine I am an occasional user. I see a specimen referred to, say a holotype, I want to learn more about that specimen. Is there some identifier I can use to find out more. I'm used to using DOIs to retrieve papers, what about specimens. So, I want:
  1. identifiers for specimens so I can retrieve more information

Imagine I am a publisher (which can be anything from a major commercial publisher to a blogger). I want to make my content more useful to my readers, and I've noticed that other's are doing this so I better get onboard. But I don't want to clutter my content with fragile links -- and if a link breaks I want it fixed, or I want a cached copy (hence the use of WebCite by some publishers). If I want a link fixed I don't want to have to chase up individual providers, I want one place to go (as I do for references if a DOI breaks). So, I want:
  1. stable links with some guarantee of persistence
  2. somebody who will take responsibility to fix the broken ones

Imagine I am a data provider. I want to make my data available, but I want something simple to put in place (I have better things to do with my time, and my IT department keep a tight grip on the servers). I would also like to be able to show my masters that this is a good thing to do, for example by being able to present statistics on how many times my data has been accessed. I'd like identifiers that are meaningful to me (maybe carry some local "branding"). I might not be so keen on some central agency serving all my data as if it was theirs. So, I want
  1. simplicity
  2. option to serve my own data with my own identifiers

Imagine I am an power user. I want lots of data, maybe grouped in ways that the data providers hadn't anticipated. I'm in a hurry, so I want to get this stuff quickly. So I want:
  1. convenient, fast APIs to fetch data
  2. flexible search interfaces would be nice, but I may just download it myself because it's probably quicker if I do it myself

Imagine I am an aggregator. I want data providers to have a simple harvesting interface so that I can grab the data. I don't need a search interface to their data because I can do it much faster if I have the data locally (federated search sucks). So I want:
  1. the ability to harvest all the data ("all your data are belong to me")
  2. a simple way to update my copy of provider's data when it changes


It's too late in the evening for me to do this justice, but I think a reasonable solution is this:
  1. Individual data providers serve their data via URLs, ideally serving a combination of HTML and RDF (i.e., linked data), but XML would be OK
  2. Each record (e.g., specimen) has an identifier that is locally unique, and the identifier is resolvable (for example, by simply appending it to a URL)
  3. Each data provider is encouraged to reuse existing GUIDs wherever possible, (e.g., for literature (DOIs) and taxonomic names) to make their data "meshable"
  4. Data provider can be harvested, either completey, or for records modified after a given date
  5. A central aggregator (e.g., GBIF) aggregates all specimen/observation data. It uses Handles (or DOIs) to create GUIDs, comprising a naming authority (one for each data provider), and an identifier (supplied by the data provider, may carry branding, e.g. "antweb:casent0100367"), so an example would be "hdl:1234567/antweb:casent0100367" or "doi:10.1234/antweb:casent0100367". Note that this avoids labeling these GUIDs as, say, http://gbif.org/1234567/antweb:casent0100367
  6. Handles resolve to data provider URL, but cached aggregator copy of metadata may be used if data provide is offline
  7. Publishers use "hdl:1234567/antweb:casent0100367" (i.e., authors use this when writing manuscripts), as they can harass central aggregator if they break
  8. Central aggregator is reponsible for generating reports to providers of how there data has been used, e.g. how many times "cited" in literaure

So, GBIF (for whoever steps up to the plate) would use handles (or DOIs). This gives them the tools to manage the identifiers, plus tells the world that we are serious about this. Publishers can trust that the links to millions of specimen records won't disappear. Providers don't have complex software to install, removing one barrier to making more data available.

I think it's time we made a serious effort to address these issues.

4 comments:

Anonymous said...

I don't think it is so much about providers thinking that consumers care about the quality of the sources. I think it is about providers who want to believe they have some special insight / ability that promotes them above the wisdom of the crowd.
The providers see themselves as special and they assume that everyone else will recognize their specialness. One example was a group of librarians that thought that everyone would recognize that their organizing of web sites based on the Library of Congress subject heading was inherently superior to what Google/Yahoo could do. This was pre-Google and they completely wasted this window of opportunity.

Roger Hyam said...

I am so fed up of these discussions. I can't believe I spend my time on them still.

We should just advocate everyone follows Linked Data best practice and be done with it. If people (GBIF or whoever) want to provide caching and other services on top of this that is great. You could call the URI and if it fails try calling an aggregator to see if they have a cached copy. It works with google. If a site is down I see if they have a cached copy. Publishers could say they will only support data that is cached in a trusted aggregator. Easy! Most importantly we don't have to mandate that anyone uses a particular service. All we have to do is help people use the technology they have now correctly. We allow initiative to compete. Sigh...

Rod Page said...

Roger,

I wrote the post after hardly any sleep, and nearly fell asleep writing it, so it's probably not very coherent.

I think we sort of agree. For providers, yes, just stick it up online, follow linked data principles, and be done with it. Of course, to make any sense of this we'll need some recommendations about how to do it (or, perhaps even better, some simple tools/examples). Otherwise, we will have to deal with a mess (albeit it a linked data mess).

But I think there is scope for a central aggregation service. I'm not sure everything will be in Google's cache (what about stuff that hasn't been linked to, that isn't crawlable, etc.?). But a central aggregation service could offer services that would help link the stuff together (there's no incentive for individual providers to do this). Such services could include validation/cleaning of metadata, providing additional GUIDs to improve provider's metadata, tools to monitor service, and provide some traceability when things go offline (i.e., who do we call to fix this).

The Handles/DOI thing may be a bit of a distraction, but I can't quite give them up (until next week, at least).

Pete DeVries said...

This seems to be a big cultural shift from when I talked about the GeoSpecies Knowledge Base at ESA 2009 See http://about.geospecies.org/At the time it seemed that I was a heretic.