Wednesday, July 22, 2015

Steve Baskauf on RDF and the "Rod Page Challenge"

Rdf w3c icon 128Steve Baskauf has concluded a thoughtful series of blog posts on RDF and biodiversity informatics with http://baskauf.blogspot.co.uk/2015/07/confessions-of-rdf-agnostic-part-7.html. In this post he discussed the "Rod Page Challenge", which was a series of grumpy posts I wrote (starting with this one) where I claimed RDF basically sucked, and to illustrate this I issued a challenge for people to do something interesting with some RDF I provided. Since this RDF didn't have a stable home I've put it on GitHub and it has a DOI http://dx.doi.org/10.5281/zenodo.20990 courtesy of GitHub's integration with Zenodo.

I argued that the RDF typically available was basically useless because it wasn't adequately linked (see Reflections on the TDWG RDF "Challenge"). Two of the RDF files I provided were created specifically created to tackle this problem (derived from my projects iPhylo Linkout http://dx.doi.org/10.1371/currents.RRN1228 and the precursor to BioNames http://dx.doi.org/10.7717/peerj.190). This marked pretty much the end of any interest I had in pursuing RDF.

Towards the end of Steve's post he writes:

At the close of my previous blog post, in addition to revisiting the Rod Page Challenge, I also promised to talk about what it would take to turn me from an RDF Agnostic into an RDF Believer. I will recap the main points about what I think it will take in order for the Rod Page Challenge to REALLY be met (i.e. for machines to make interesting inferences and provide humans with information about biodiversity that would not be obvious otherwise):

  1. Resource descriptions in RDF need to be rich in triples containing object properties that link to other IRI-identified resources.
  2. "Discovery" of IRI-identified resources is more likely to lead to interesting information when the linked IRIs are from Internet domains controlled by different providers.
  3. Materialized entailed triples do not necessarily lead to "learning" useful things. Materialized entailed triples are useful if they allow the construction of more clever or meaningful queries, or if they state relationships that would not be obvious to humans.

Steve's point 1 is essentially the point I was making with the challenge. At the time of the challenge, RDF from major biodiversity informatics projects was in silos, with few (if any) links to external resources (the kinds of things Steve refers to in his point 2). As a result, the promised benefits from RDF simply haven't materialised. The lesson I took from this is that we need rich, dense cross-links between data sources (the "biodiversity knowledge graph"), and that's one reason I've been obsessed with populating BioNames, which links animal names to the primary literature (I'm planning to extend this to plants as well). Turns out , creating lots of cross links is really hard work, much harder than simply pumping out a bunch of RDF and waiting for it to automagically coalesce into an all-connected knowledge graph.

I posed the challenge back in 2011, and since then I think the landscape has changed to the extent that I wonder if trying to "fix" RDF is really the way forward.

XML is dead

Anyone (sane) developing for the web and wanting to move data around is using JSON, XML is hideous and best avoided. Much of the early work on RDF used XML, which only made things even harder than they already were. JSON beats XML, to the extent that RDF itself now has a JSON serialisation, JSON-LD. But JSON-LD is about more than the semantic web (see JSON-LD and Why I Hate the Semantic Web), and has the great advantage that you can actually ignore all the RDF cruft (i.e., the namespaces) and simply treat the data as key-value pairs (yay!). Once you do that, then you can have fun with the data, especially with databases such as CouchDB ("fun" and "database" in the same sentence, I know!).

Key-value pairs, document stores, and graph databases

The NoSQL "movement" has thrown up all sorts of new ways to handle data and to think about databases. We can think of RDF as describing a graph, but it carries the burden of all the namespaces, vocabularies, and ontologies that come with it. Compare that with the fun (there's that word again) of graph databases such as Neo4J with its graph gists. The Neo4J folks have made a great job of publicising their approach, and making it easy and attractive to play with.

So, we're in a interesting time when there are a bunch of technologies available, and I think maybe it's time to ask whether the community's allegiance to RDF and the Semantic Web has been somewhat misplaced...