Thursday, August 01, 2013

A use case for RDF in taxonomy

RDF Resource Description Framework Icon
Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases.

The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant. The basic idea is that errors in databases (in this case, UniProt) can be flagged by constructing queries in SPARQL that return results if there is a problem (for example if a sequence annotation is contradictory).

In recent posts I've been complaining about errors in the GBIF taxonomy, notably duplicate taxa that are synonyms. One way to tackle this would be to develop a set of SPARQL queries that we could use to flag potential problems. For example, if two names are objective synonyms then only one of them should be a node in the GBIF classification. If both exist then we have a problem. If we know a name is a homonym of an older name, but that name exists in the GBIF classification, then we could flag that as an issue. We could also construct queries that flag possible problems, even if we don't have precise information on synonymy. For example, in this post I noted that several frog species appear twice in the GBIF classification because GBIF has aggregated classifications that put these frogs in different genera. We could catch such cases by constructing a query to check whether the same species name (specific epithet) appeared in different genera within the same family.

The advantage of using RDF and SPARQL in this context is that that the queries are portable. Assuming everyone uses the same vocabulary (e.g., the TDWG LSID vocabularies) then queries can be constructed by one person (e.g., me) and then used by anyone who has their data in a triple store. We could develop a set of "taxonomy tests" that anyone could apply to their database.

This idea needs some more work, but it would be fun to play with some data and see how many kinds of errors or issues we can catch in this way.