Wednesday, May 09, 2007

Catalogue of Life design flaw

A bit more browsing of the Catalogue of Life annual checklist for 2007 reveals a rather annoying feature that, I think, cripples the Catalogue's utility. With each release the checklist grows in size. From their web site:
The Species 2000 & ITIS Catalogue of Life is planned to become a comprehensive catalogue of all known species of organisms on Earth by the year 2011. Rapid progress has been made recently and this, the seventh edition of the Annual Checklist, contains 1,008,965 species.

However, with each release the identifiers for each taxon change. For example, if I were to link to the record for the peacrab Pinnotheres pisum this year (2007), I would link to record 3803555, but last year I would have linked to 872170. Record 872170 no longer exists in the 2007 edition.

So, what would a user who based their taxonomic database on the Catalogue of Life do? All their links would break (not just because the URL interface has changed, but the underlying identifiers have changed as well). It's as if the authors of the catalogue have been oblivious to the discussion on globally unique identifiers (GUIDs) and the need for stable, persistent identifiers.

Anybody building a database that gets updated, and possible rebuilt needs to thik about how their identifiers will change. If identifiers are simply the primary keys in a table, then they will likely be unstable, unless great care is taken. Althernatively, databases that are essentially aggregations of data available elsewhere could use GUIDs as the primary keys. This means that even if the database is restructured, the keys (and hence the identifiers) don't change. For the user, everything still works.

Despite the favourable press about its progress (e.g., doi:10.1038/news050314-6, Environmental Research Web, and CNN), I think the catalogue needs some serious rethinking if it is to be genuinely useful. For more on this, see my earlier posting on how the catalogue handles literature.

Image of Pinnotheres pisum by Hans Hillewaert obtained from Wikimedia Commons.


David said...

This has long been a problem with Species2000 and ITIS. This is why it is far better to use uBio's web services to pull classifications and their associated LSIDs, which are stable. However, what's not clear to me is how uBio will deal with this issue in their ClassificationBank, which is supposed to act as an aggregation of synonymies, etc. Since The Catalogue of Life will necessarily be the scaffolding upon which the Encyclopedia of Life will be built, this will be really interesting.

sexy said...