iPhylo: PLoS Hubs

Roderic D. M. Page

Showing posts with label PLoS Hubs. Show all posts

Wednesday, July 10, 2013

The challenge of semantically marking up articles (more thoughts on PLoS Hubs)

Here's a sketch of my vision of how to make something like the original PLoS Hubs vision (see The demise of the @PLoS Biodiversity Hub: what lessons can we learn? for background). In the blog post explaining the vision behind PLoS Hubs (Aggregating, tagging and connecting biodiversity studies), David Mindell wrote:

The Hub is a work in progress and its potential benefits are many. These include enhancing information exchange and databasing efforts by adding links to existing publications and adding semantic tagging of key terms such as species names, making that information easier to find in web searches. Increasing linkages and synthesizing of biodiversity data will allow better analyses of the causes and consequences of large scale biodiversity change, as well as better understanding of the ways in which humans can adapt to a changing world.

So, first up, some basic principles. The goal (only slightly tongue in cheek) is for the user to never want to leave the site. Put another way, I want something as engaging as Wikipedia, where almost all the links are to other Wikipedia pages (more on this below). This means object (publication, taxon name, DNA sequence, specimen, person, place) gets a "page". Everything is a first class citizen.

It seems to me that the fundamental problem faced by journals that "semantically enhance" their content by adding links is that those links go somewhere else (e.g., to a paper on another journal's site, to an external database). So, why would you do this? Why actively send people away?. Google can do this because you will always come back. Other sites don't have this luxury. In the case of citations (i.e., adding DOIs to the list of literature cited) I guess the tradeoff is that because all journals are in this together, you will receive traffic from your "rivals" if papers they publish papers that cite your content. So you are building a network that will (potentially) drive traffic to you (as well as send people away). You are building a network across which traffic can flow (i.e., the "citation graph").

If we add other sorts of links (say to GenBank, taxon databases, specimen databases like GBIF, locality databases such as Protected Planet, etc.) then that traffic has no obvious way of coming back. This also has a deeper consequence, those databases don't "know" about these links, so we loose the reverse citation information. For example, a semantically marked-up paper may know that it cites a sequence, but that sequence doesn't know that it has been cited. We can't build the equivalent citation graphs for data.

One way to tackle this is to bring the data "in house", and this is what Pensoft are doing with their markup of ZooKeys. But they are basically creating mashups of external data sources (a bit like iSpecies). We need to do better.

Prototyping

One way to make this more concrete is to think how a hub could be prototyped. Let's imagine we start with a platform like Semantic MediaWiki. I've an on-again/off-again relationship with Semantic MediaWiki, it's powerful but basically a huge, ugly hack on top off a huge piece of software that wasn't designed for this purpose, but it's a way to conceptualise what can be done. So, I'm not arguing that Semantic MediaWiki is how to do this for real (trust me, I'm really, really not), but that it's a way to explore the idea of a hub.

OK, let's say we start with a paper, say an Open Access paper from PLoS. We parse the XML, extract every entity that matters (authors, cited articles, GenBank accessions, taxon names, georeferenced localities, specimen codes, chemical compounds, etc.) and then create pages for each entity (including the article itself). Each page has a URL that uses an external id (where available) as a "slug" (e.g., the URL for the page for a GenBank sequence includes the accession number).

Now we have exploded the article into a set of pages, which are linked to the source article (we can use Semantic MediaWiki to specify the type of those links), and each entity "knows" that it has a relationship with that article.

Then we set about populating each page. The article page is trivial, just reformat the text. For other entities we construct pages using data from external databases wherever possible (e.g., metadata about a sequence from GenBank).

So far we have just one article. But that article is linked to a bunch of other articles (those that it cites), and there may be other less direct links (e.g., GenBank sequences are linked to the article that publishes them, taxonomic names may be linked to articles that publish the name, etc.). We could add all these articles to a queue and process each article in the same way. In a sense we are now crawling a graph that includes the citation graph, but it includes links the citation graph misses (such as articles that cite the same data, see Enhanced display of scientific articles using extended metadata for more on this).

The first hurdle we hit will be that many articles are not open access, in which case they can't be exploded into full text and associated entities. But that's OK, we can create article pages that simply display the article metadata, and link(s) to articles in the citation graph. Furthermore, we can get some data citation links for closed access articles, e.g. from GenBank.

So now we let the crawler loose. We could feed it a bunch of articles to start with (e.g., those in the original Hub (if that list still exists), those from various RSS feeds (e.g., PLoS, ZooKeys, BMC, etc.).

Entry points

Users can enter the hub in various ways. Via an article would be the traditional way (e.g., via a link from the article itself on the PLoS site). But let's imagine we are interested in a particular organism, such as Macaroni Penguins. PLoS has an interesting article on their feeding [Studying Seabird Diet through Genetic Analysis of Faeces: A Case Study on Macaroni Penguins (Eudyptes chrysolophus) doi:10.1371/journal.pone.0000831]. This article includes sequence data for prey items. If we enhance this article we build connections between the bird and its prey. In the simple level, the hub page for the crustacea and fish it feed on would include citations to this article, enabling people to follow that connection (with more sophistication the nature of the relationship could also be specified). This article refers to a specific locality and penguin colony, which is in a marine reserve (Heard Island and McDonald Islands Marine Reserve). Extract these entities, and other papers relevant to this are would be linked as they are added (e.g., using a geospatial query). Hence people interested in what we now about the biology of organisms in specific localities would dive in via the name (or URL) of the locality.

Summary

The core idea here is taking an article, exploding it, and treating every element as a first class citizen of the resulting database. It is emphatically not about a nice way to display a bunch of articles, it's not a "collection", and articles aren't privileged. Nor are articles from any particular publisher privileged. One consequence of this is that it may not appeal to an individual publisher because it's not about making a particular publisher's content look better than another's (although having Open Access content in XML makes it much easier to play this game).

The real goal of an approach like this is to end up with a database (or "knowledgebase") that is much richer than simply a database of articles (or sequences, or specimens), and which moves from being a platform for repurposing article text to a platform for facilitating discovering.

Tuesday, July 09, 2013

The demise of the @PLoS Biodiversity Hub: what lessons can we learn?

Jonathan Eisen recently wrote that the PLOS Hub for Biodiversity is soon to be retired, and sure enough it's vanished from the web (the original URL hubs.plos.org/web/biodiversity/ now bounces you straight to http://www.plosone.org/, you can still see what it looked like in the Wayback Machine).

Like Jonathan, I was involved in the hub, which was described in the following paper:

Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. (S. A. Rands, Ed.)PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491

In retrospect PLoS's decision to pull the hub is not surprising. The original proposal imagined a web site looking like this, with the goal of building a "dynamic community".

From my perspective the PLoS HUb failed for two reasons. The first is that PLoS weren't nearly as ambitious as they could have been. The second is that the biodiversity informatics community simply couldn't (an arguably still can't) provide the kind of services that PLoS would have needed to make the Hubs something worth nurturing.

After a meeting at the California Academy of Science in April 2010 to discuss the hub idea I wrote a ranty blog post (Biodiversity informatics = #fail (and what to do about it)) where I expressed my frustration that we had a group of people (i.e., PLoS) rock up and express serious interest in doing something with biodiversity data, and biodiversity informatics collectively failed them. We could have been aiming for a cool database of "semantically enhanced" publications that we could query taxonomically, geographically, phylogenetically, etc. (at least, that's what I was hoping PLoS were aiming for). Instead it became clear that most of the basic services were simply not available (we didn't have a simple code to extract GenBank accession numbers, specimens codes, etc., we couldn't link specimen codes to anything online, and woe betide you if you asked what a taxon name was).

In fairness, it also became pretty clear that PLoS weren't going to go too far down the line of an all-singing portal to biodiversity data. They were really looking at a shiny web site that housed a collection of Open Access papers on biodiversity. But my point is it could have been so much more than that. We had a chance to build a platform,a knowledge base for biodiversity data that had an accessible front end (e.g., the traditional publication) but exploded that into its component parts so we could spin the data around and ask other questions.

Inspired by the possibilities I spent the next couple of months playing with some linked data demos (see here and here, the links in these demos have long since died). The idea was to explore how much of what I imagined the PLoS Hub could be it was possible to build using RDF and SPARQL. It was fun, but RDF and SPARQL are awful things to "play" with, and the vast bulk of the data had to be wrapped in custom scripts I wrote because the original data providers didn't supply RDF. As I've written elsewhere, I think the cost of getting to a place where RDF enables you to do meaningful stuff is just too high. Our data are too messy, we lack agreed identifiers, and we either have too many or too few vocabularies (and those we do have invariably spark lengthy, philosophical debates - vocabularies are taxonomies of data, need I say more). The RDF approach is also doomed to fail because it assumes multiple decentralised data repositories are the way forward. In my experience, these cannot deliver the kinds of things we need. The data need to be brought together, cleaned, aligned, augmented, and finally linked together. This is much easier to do if all the data are in one place.

So where does this leave us? In many ways I'd like to attempt something like PLoS Hubs again, or perhaps more precisely, think about building a platform so that if a publisher came along and wanted to do something similar (but more ambitious) we would have the tools in place that could make it happen. What I'd like is a way more sophisticated version of this, where you could explore data in various dimensions (geography, taxonomy, phylogeny), track citation and provenance information (what papers cite this specimen, what sequences is it a voucher for, what trees are built on those sequences). If we had a platform that supported these sorts of queries, not only could we provide great environment upon which we could embed scientific publications, we could also support the kinds of queries we can't do at the moment (e.g., give me all the molecular phylogenies for species in Madagascar, locate all the data - publications, taxonomic identifications, sequences - about a specimen, etc.).

I'll leave you with a great rant about platforms. It's long but it's fun, and I think it speaks to where we are now in biodiversity informatics (hint, we aren't Amazon).

Tuesday, October 05, 2010

PLoS Biodiversity Hub launches

The PLoS Biodiversity Hub has launched today. There's a PLoS blog post explaining the background to the project, as well as a summary on the Hub itself:

The vision behind the creation of PLoS Hubs is to show how open-access literature can be reused and reorganized, filtered, and assessed to enable the exchange of research, opinion, and data between community members.

PLoS Hubs: Biodiversity provides two main functions to connect researchers with relevant content. First, open-access articles on the broad theme of biodiversity are selected and imported into the Hub. In time, the content will also be enhanced so that the articles are connected with data, and we will provide features to make the articles easier for people to use. These two functions - aggregation and adding value - build on the concept of open access, which removes all the barriers to access and reuse of journal article content.

Readers of iPhylo may recall my account of one of the meetings involved in setting up this hub, in which I began to despair about the lack of readiness of biodiversity informatics to provide much of the information needed for projects such as hubs. Despite this (or perhaps, because of it), I've become a member of the steering committee for the Biodiversity Hub. There's clearly a lot of interest in repurposing the content found in scientific articles, and I think we're going to see an increasing number of similar projects from the major players in science publishing, Open Access or otherwise. One of the challenges is going to be moving beyond the obvious things (such as making taxon names clickable) to enable new kinds of ways of reading, navigating, and querying the literature, and exploring ways to track the use that is made of the information in these articles. Biodiversity studies are ideally placed to explore this as the subject is data rich and much of that data, such as specimens and DNA sequences, persist over time and hence get reused (data citation gets very boring if the data is used just once). We also have obvious ways to enrich navigation, such as spatially and taxonomically.

For now the PLoS Biodiversity Hub is very pretty, but it's more a statement of intent than a real demonstration of what can be done. Let's hope our field gets its act together and seizes the opportunity that initiatives like the Hub represents. Publishers are desperate to differentiate themselves from their competitors by providing added value as part of the publication process, and they provide a real use case for all the data that the biodiversity projects have been accumulating over the last couple of decades.