Wednesday, September 21, 2011

Linked data that isn't: the failings of RDF

OK, a bit of hyperbole in the morning. One of the goals of RDF is to create the Semantic Web, an interwoven network of data seamlessly linked by shared identifiers and shared vocabularies. Everyone uses the same identifiers for the same things, and when they describe these things they use the same terms. Simples.

Of course, the reality is somewhat different. Typically people don't reuse identifiers, and there are usually several competing vocabularies we can chose from. To give a concrete example, consider two RDF documents describing the same article, one provided by CiNii, the other by CrossRef. The article is:

Astuti, D., Azuma, N., Suzuki, H., & Higashi, S. (2006). Phylogenetic Relationships Within Parrots (Psittacidae) Inferred from Mitochondrial Cytochrome-b Gene Sequences(Phylogeny). Zoological science, 23(2), 191-198. doi:10.2108/zsj.23.191

You can get RDF for a CiNii record by appending ".rdf" to the URL for the article, in this case http://ci.nii.ac.jp/naid/130000017049. For CrossRef you need a Linked Data compliant client, or you can do something like this:


curl -D - -L -H "Accept: application/rdf+xml" "http://dx.doi.org/10.2108/zsj.23.191"

You can view the RDF from these two sources here and here.

No shared identifiers
The two RDF documents have no shared identifiers, or at least, any identifiers they do share aren't described in a way that is easily discovered. The CrossRef record knows nothing about the CiNii record, but the CiNii document includes this statement:


<rdfs:seeAlso rdf:resource="http://ci.nii.ac.jp/lognavi?name=crossref
&amp;id=info:doi/10.2108/zsj.23.191" dc:title="CrossRef" />

So, CiNii knows about the DOI, but this doesn't help much as the CrossRef document has the URI "http://dx.doi.org/10.2108/zsj.23.191", so we don't have an explicit statement that the two documents refer to the same article.

The other shared identifier the documents could share is the ISSN for the journal (0289-0003), but CiNii writes this without the "-", and uses the PRISM term "prism:issn", so we have:


<prism:issn>02890003</prism:issn>


whereas CrossRef writes the ISSN like this:


<ns0:issn xmlns:ns0="http://prismstandard.org/namespaces/basic/2.1/">
0289-0003</ns0:issn>


Unless we have a linked data client that normalises ISSNs before it does a SPARQL query we will miss the fact that these two articles are in the same journal.

Inconsistent vocabularies
Both CiNii use the PRISM vocabulary to describe the article, but they use different versions. CrossRef uses "http://prismstandard.org/namespaces/basic/2.1/" whereas CiNii uses "http://prismstandard.org/namespaces/basic/2.0/". Version 2.1 versus version 2.0 is a minor difference, but the URIs are different and hence they are different vocabularies (having version numbers in vocabulary URIs is asking for trouble). Hence, even if CiNii and CrossRef wrote ISSNs in the same way, we'd still not be able to assert that the articles come from the same journal.
Inconsistent use of vocabularies
Both CiNii use FOAF for author names, but they write the names differently:


<foaf:name xml:lang="en">Suzuki Hitoshi</foaf:name>


<ns0:name xmlns:ns0="http://xmlns.com/foaf/0.1/">Hitoshi Suzuki</ns0:name>


So, another missed opportunity to link the documents. One could argue this would be solved if we had consistent identifiers for authors, but we don't. In this case CiNii have their own local identifiers (e.g. http://ci.nii.ac.jp/nrid/1000040179239), and CrossRef has a rather hideous looking Skolemisation: http://id.crossref.org/contributor/hitoshi-suzuki-2gypi8bnqk7yy.

In summary, it's a mess. Both CiNii and CrossRef organisations are whose core business is bibliographic metadata. It's great that both are serving RDF, but if we think this is anything more than providing metadata in a useful format I think we may be deceiving ourselves.