iPhylo: Semantic Publishing: towards real integration by linking

Roderic D. M. Page

Monday, April 20, 2009

Semantic Publishing: towards real integration by linking

PLoS Computational Biolgy has recently published "Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article" (doi:10.1371/journal.pcbi.1000361) by David Shotton and colleagues. As a proof of concept, they took Reis et al. (doi:10.1371/journal.pntd.0000228) and "semantically enhanced" it:

These semantic enhancements include provision of live DOIs and hyperlinks; semantic markup of textual terms, with links to relevant third-party information resources; interactive figures; a re-orderable reference list; a document summary containing a study summary, a tag cloud, and a citation analysis; and two novel types of semantic enrichment: the first, a Supporting Claims Tooltip to permit “Citations in Context”, and the second, Tag Trees that bring together semantically related terms. In addition, we have published downloadable spreadsheets containing data from within tables and figures, have enriched these with provenance information, and have demonstrated various types of data fusion (mashups) with results from other research articles and with Google Maps.

The enhanced article is here: doi:10.1371/journal.pntd.0000228.x001. For background on these enhancements, see also David's companion article "Semantic publishing: the coming revolution in scientific journal publishing" (doi:10.1087/2009202, PDF preprint available here). The process is summarised in the figure below (Fig. 10 from Shotton et al., doi:10.1371/journal.pcbi.1000361.g010).

While there is lots of cool stuff here (see also Elsevier's Article 2.0 Contest, and the Grand Chalenge, for which David is one of the judges), I have a couple of reservations.

The unique role of the journal article?

Shotton et al. argue for a clear distinction between journal article and database, in contrast to the view articulated by Philip Bourne (doi:10.1371/journal.pcbi.0010034) that there's really no difference between a database and a journal article and that the two are converging. I tend to favour the later viewpoint. Indeed, as I argued in my Elsevier Challenge entry (doi:10.1038/npre.2008.2579.1), I think we should publish articles (and indeed data) as wikis, so that we can fix the inevitable error. We can always roll back to the original version if we want to see the author's original paper.

Real linking

But my real concern is that the example presented is essentially "integration by linking", that is, the semantically enhanced version gives us lots of links to other information, but these are regular hyperlinks to web pages. So, essentially we've gone from pre-web documents with no links, to documents where the bibliography is hyperlinked (most online journals), to documents where both the bibliography and some terms in the text are hyperlinked (a few journals, plus the Shotton et al. example). I'm a tad underwhelmed.

What bothers me about this is:

The links are to web pages, so it will be hard to do computation on these (unless the web page has easily retrievable metadata)
There is no reciprocal linking -- the resource being linked to doesn't know it is the target of the link

Web pages are for humans

The first concern is that the marked-up article is largely intended for human readers. Yes, there are associated metadata files in RDF N3, but the core "added value" is really only of use to humans. For it to be of use to a computer, the links would have to go to resource that the computer can understand. A human clicking on many of the links will get a web page and they can interpret that, but computers are thick and they need a little help. For example, one hyperlinked term is Leptospira spirochete, linked to the uBio namebank record (click on the link to see it). The link resolves to a web page, so it's not much use to a computer (unless if has a scrapper for uBio HTML). Ironically, uBio serves LSIDs, so we could retrieve RDF metadata for this name (urn:lsid:ubio.org:namebank:255659), but there's nothing in the uBio web page that tells the computer that.

Of course, Shotton et al. aren't responsible for the fact that most web pages aren't easily interpreted by computers, but simply embedding links to web pages isn't a big leap forward. What could they have done instead? One approach is to link to resources that are computer-readable. For example, instead of linking the term "Oswaldo Cruz Foundation" to that organisation's home page (http://www.fiocruz.br/cgi/cgilua.exe/sys/start.htm?tpl=home), why not use the DBpedia URI http://dbpedia.org/page/Instituto_Oswaldo_Cruz? Now we get both a human-readable page, and extensive RDF that a computer can use. In other words, if we crawl the semantically enhanced PLoS article with a program, I want to be able to have that crawler follow the links and still get useful information, not the dead end of a HTML web page. Quite a few of the institutions listed in the enhanced paper have DBPedia URIs:

Brazilian National Commission for Ethics in Research
Brazilian National Research Council
Centers for Diseases Control and Prevention
Centro de Pesquisas Gonçalo Moniz, Fundação Oswaldo Cruz, Ministério da Saúde
Company for Urban Development of the State of Bahia (CONDER)
Division of International Medicine and Infectious Diseases, Weill Medical College of Cornell University
Environmental Systems Research Institute
Escola Nacional da Saúde Pública, Fundação Oswaldo Cruz, Ministério da Saúde
Image Bioinformatics Research Group, Department of Zoology, University of Oxford
MRC Biostatistics Unit
Oswaldo Cruz Foundation
R Foundation for Statistical Computing
Royal Tropical Institute, Holland
Secretária Estadual de Saúde da Bahia
Universidad de Buenos Aires
Universidade Federal Rural do Rio de Janeiro
Urban Health Council of Pau da Lima
US National Institutes of Health
Weill Medical College of Cornell University
WHO Collaborative Laboratory for Leptospirosis

Why does this matter? Well, if you use DBPedia URIs you get RDF, plus you get connections with the Linked Data crowd, who are rapidly linking diverse data sets together:

I think this is where we need to be headed, and with a little extra effort we can get there, once we move on from thinking solely about human readers.

An alternative approach (and one that I played with in my Challenge entry, as well as my ongoing wiki efforts) is to create what Vandervalk et al. term a "semantic warehouse" (doi:10.1093/bib/bbn051). Information about each object of interest is stored locally, so that clicking on a link doesn't take you off-site into the world wide wilderness, but to information about that object. For example, the page for the paper Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio) lists the papers cited, clicking on one takes you to the page about that paper. There are limitations to this approach as well, but the key thing is that one could imagine doing computations over this (e.g., computing citation counts for DNA sequences, or geospatial queries across papers) that simple HTML hyperlinking won't get you.

Reciprocal links

The other big issue I have with the Shotton et al. "integration by linking" is that it is one-way. The semantically enhanced paper "knows" that it links to, say, the uBio record for Leptospira, but uBio doesn't know this. It would enhance the uBio record if it knew that doi:10.1371/journal.pntd.0000228.x001 linked to it.

Links are inherently reciprocal, in the sense that if paper 1 cites paper 2, then paper 2 is cited by paper 1.

Publishers understand this, and the web page of an article will often show lists of papers that cite the paper being displayed. How do we do this for data and other objects of interest? If we database everything, then it's straightforward. CrossRef is storing citation metadata and offers a "forward linking" service, some publishers (e.g., Elsevier and Highwire) offer their own versions of this. In the same way, this record for GenBank sequence AY322281 "knows" that it is cited by (at least) two papers because I've stored those links in a database. Knowing that you're being linked to dramatically enhances discoverability. If I'm browsing uBio I gain more from the experience if I know that the PLoS paper cites Leptospira.

Knowing when you're being linked to

If we database everything locally then reciprocal linking is easy. But, realistically, we can't database everything (OK, maybe that's not strictly true, can can think of Google as a database of everything). The enhanced PLoS paper "knows" that it cites the uBio record, how can the uBio record "know" that it has been cited by the PLoS paper? What if the act of linking was reciprocal? How can we achieve this in a distributed world? Some possibilities:

we have an explicit API embedded in the link so that uBio can extract the source of the link (could be spoofed, need authentication?)
we use OpenURL-style links that embed the PLoS DOI, so that uBio knows the source of the link (OpenURL is a mess, but potentially very powerful)
uBio uses the HTTP referrer header to get the source of the link, then parses the PLoS HTML to extract metadata and the DOI (ugly screen scraping, but no work for PLoS)

Obviously this needs a little more thought, but I think that real integration by linking requires that the resources being linked are both computer and human readable, and that both resources know about the link. This would create much more powerful "semantically enhanced" publications.