Wednesday, February 29, 2012

Making biodiversity data sticky: it's all about links

Who invented velcro?

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together. Given enough identifiers and enough data, then we could rapidly assemble a "ball" of interconnected data. A published the diagram below as part of my Elsevier Challenge entry (preprint, published version) summarises some of the links between diverse kinds of biological data:
While in principle many of these links should be trivial to create, in practice they aren't. One major obstacle is the lack of globally unique identifiers, or if such identifiers exist they aren't being used. As a result, our data is anything but sticky. In the absence of identifiers, creating links between different data sets can a significant undertaking. One way to tackle this is focus on just one kind of link at a time and create a database of those links. The diagram below shows some of the links I've been working on:
For example, the iPhylo Linkout project creates links between taxon concepts in NCBI and Wikipedia. The iTaxon project is a mapping between taxonomic names and publications. I've briefly explored mapping host-parasite relationships using GenBank, and I'm currently exploring the links between publications and specimens. This list certainly doesn't exhaust the set of possible links, but it's a start. The challenge is to create sufficient links for biodiversity data to finally coalesce and for us to be able to ask questions that span multiple sources and types of data.