Friday, July 10, 2009

NCBI taxonomy, TDWG vocabularies, and RDF


Lately I've been returning to playing with RDF and triple stores. This is a serious case of déjà vu, as two blogs I've now abandoned will testify (bioGUID and SemAnt). Basically, a combination of frustration with the tools, data cleaning, and the lack of identifiers got in the way of making much progress. I gave up on triple stores for a while, rolling my own Entity–Attribute–Value (EAV) database, which I used for the Elsevier Challenge (EAV databases are essentially key-value databases, CouchDB being a well-known example).

Now, I'm revisiting triple stores and SPARQL, partly because Linked Data is gaining momentum, and partly because we now have a few LSID providers, and some decent vocabularies from TDWG. Having created a LSID resolver that plays nicely with Linked Data (it also does the same thing for DOIs), it's time to dust off SPARQL and see what can be done.

One reason there's interest in having GUIDs and standard vocabularies is so that we can link different sources of information together. But more than just linking, we should be able to compute across these links and learn new things, or at least add annotations from one database to another.

To make this concrete, take the NCBI taxon 101855 , Lulworthia uniseptata. If we visit the NCBI page we see links to other resources, such as Index Fungorum record 105488, which tells us that Lulworthia uniseptata was published in Trans. Mycol. Soc. Japan 25(4): 382 (1984), and that the current name is Lulwoana uniseptata, which was published in Mycol. Res. 109(5): 562 (2005).

Wouldn't it be nice to be able to automatically link these things together? And wouldn't it be nice to have identifiers for the literature, rather than only human-readable text strings? Using bioGUID, we can discover that Mycol. Res. 109(5): 562 (2005) has the DOI doi:10.1017/S0953756205002716 -- I haven't found Trans. Mycol. Soc. Japan 25(4): 382 (1984) online anywhere.

Now, given that we have LSIDs for Index Fungorum, I can resolve urn:lsid:indexfungorum.org:names:369395 and discover that

urn:lsid:indexfungorum.org:names:369395 tname:hasBasionym urn:lsid:indexfungorum.org:names:105488

and, I can add the statement

urn:lsid:indexfungorum.org:names:36939 tcommon:publishedInCitation doi:10.1017/S0953756205002716

What I'd like to do is link this to the NCBI taxon, so that I can display this additional knowledge in one place (i.e., there is an additional name for this fungus, and where it is published). To do this, I need the NCBI taxonomy in RDF. Turns out that everyone and their dog has been generating RDF versions of the NCBI taxonomy, including Uniport (source of the diagram above). The problem is, each effort creates their own project-specific vocabulary. For example , here is the record for NCBI taxon 101855 in Uniprot RDF (http://www.uniprot.org/taxonomy/101855):


<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://purl.uniprot.org/taxonomy/101855">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Taxon"/>
<rank rdf:resource="http://purl.uniprot.org/core/Species"/>
<scientificName>Lulworthia uniseptata</scientificName>
<otherName>Zalerion maritimum</otherName>
<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/45817"/>
<partOfLineage>false</partOfLineage>
</rdf:Description>
</rdf:RDF>


Uniprot has it's own vocabulary, http://purl.uniprot.org/core/. So, what I'd like to do is create a version of the NCBI taxonomy using TDWG's TaxonConcept vocabulary, so that it becomes straightforward to link NCBI to name databases such as Index Fungorum, IPNI, Zoobank, and ION that are serving taxon names.