Wednesday, July 17, 2013

Augmenting ZooKeys bibliographic data to flesh out the citation graph

Zookeys logoIn a previous post (Learning from eLife: GitHub as an article repository) I discussed the advantages of an Open Access journal putting its article XML in a version-controlled repository like GitHub. In response to that post Pensoft (the publisher of ZooKeys) did exactly that, and the XML is available at https://github.com/pensoft/ZooKeys-xml.

OK, "now what?" I hear you ask. Originally I'd used the example of incorrect bibliographic data for citations as the motivation, but there are other things we can do as well. For example, when reading a ZooKeys article (say, using my eLife Lens-inspired viewer) I notice references that should have a DOI but which don't. With the XML available I could add this. This adds another link in the citation graph (in this case connecting the ZooKeys paper with the article it cites). If Pensoft were to use that XML to regenerate the HTML version of the article on their web site then the reader will be able to click on the DOI and read the cited article (instead of the "cut-and-paste-and-Google-it" dance). Furthermore, Pensoft could update the metadata they've submitted to CrossRef, so that CrossRef knows that the reference with the newly added DOI has been cited by the ZooKeys paper.

To experiment with this I've written some scripts that take ZooKeys XML, extract each citation from the list of literature cited, and look up DOIs for each reference that lacks them (using the CrossRef metadata search API). If a DOI is found then I insert it into the original XML. I then push this XML to my fork of Pensoft's repository (https://github.com/rdmpage/ZooKeys-xml). I can then ask Pensoft to update their repository (by issuing a "pull request"), and if Pensoft like what they see, they can accept my edits.

Automating the process makes this much more scalable, although manual editing will still be useful in some cases, especially where the original references haven't been correctly atomised into title, journal, etc.

So that the output is visible independently of Pensoft deciding whether to accept it, I've updated my Zookeys article viewer to fetch the XML not from the ZooKeys web site, but from my GitHub repository. This means you get the latest version of the XML, complete with additional DOIs (if any have been added).

Initial experiments are encouraging, but it's also apparent that lots of citations lack DOIs. However, this doesn't mean that they aren't online. Indeed, a growing number of articles are available through my BioStor repository, and through BioNames. Both of these sites have an API, so the next step is to add them to the script that augments the XML. This brings us a little closer to the ultimate goal of having every taxonomic paper online and linked to every paper that either cites, or is cited by, that paper.

Monday, July 15, 2013

Using @IFTTT to create a Twitter stream for @EvolDir

IftttSince 2009 I've been running a service that takes posts to the EvolDir mailing list and sends them to a Twitter stream at @EvolDir. This service was running on a local machine which has died. Rather than wait until I rebuild that server (again), I looked around for other ways to recreate this service. I decided to use IFTTT ("if this then that"), which is a wonderful way to chain together web services.

IFTTT uses "recipes" to chain together two services (if something happens here, do this). For example, given a RSS feed of recent messages to EvolDir, I can create a recipe to send the links to those posts to Twitter:
Rsstwitter
Great, only EvolDir is an old-fashioned mailing list distributed by email only. There is no web site for the list that has URLs to each post, and no RSS feed. IFTTT to the rescue again. I created a recipe that reads emails for a GMail account that is subscribed to EvolDir, and each time it gets an email from EvolDir it sends the contents of the email to a blog on Tumblr:
Mail2tumblr
Now each post has a web presence, and Tumblr generates a RSS feed for the blog, so the first recipe can take that feed and every time the feed is updated it will send that post to Twitter. Simples.

Friday, July 12, 2013

Learning from eLife: GitHub as an article repository

Playing with my eLife Lens-inspired article viewer and some recent articles from ZooKeys I regularly come across articles that are incorrectly marked up. As a quick reminder, my viewer takes the DOI for a ZooKeys article (just append it to http://bionames.org/labs/zookeys-viewer/?doi=, e.g. http://bionames.org/labs/zookeys-viewer/?doi=10.3897/zookeys.316.5132), fetches the corresponding XML and displays the article.

Taking the article above as an example, I was browsing the list of literature cited and trying to find those articles in BioNames or BioStor. Sometimes an article that should have been found wasn't, and on closer investigation the problem was that the ZooKeys XML has mangled the citation. To illustrate, take the following XML:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

I've highlighted the contents of the article-title (title) and source (journal) tags, respectively. Unfortunately the actual title and journal should look like this:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

Tools to find articles that rely on accurately parsed metadata, such as OpenURL, will fail in cases like this. Of course, we could use tools that don't have this requirement, but we could also fix the XML so that OpenURL resolves will succeed.

This is where the example of the journal eLife comes in. They deposit article XML in GitHub where anyone can grab it and mess with it. Let's imagine we did the same for ZooKeys, created a GitHub repository for the XML, and then edited it in cases where the article metadata is clearly broken. A viewer like mine could then fetch the XML, not from ZooKeys, but from GitHub, and thus take advantage of any corrections made.

We could imagine this as part of a broader workflow that would also incorporate other sources of articles, such as BHL. We could envisage workflows that take BHL scans, convert them to editable XML, then repurpose that content (see BHL to PDF workflow for a sketch). I like the idea that there is considerable overlap between the most recent publishing ventures (such as eLife and ZooKeys) and the goal of bringing biodiversity legacy literature to life.

Thursday, July 11, 2013

Barcode Index Number (BIN) System in DNA barcoding explained

Journal pone 0066213 g001Quick note to highlight the following publication:
Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.)PLoS ONE, 8(7), e66213. doi:10.1371/journal.pone.0066213
This paper outlines the methods used by the BOLD project to cluster sequences into "BINS", and touches on the issue of dark taxa (taxa that are in GenBank but which lack formal scientific names). Might be time to revisit the dark taxa idea, especially now that I've got a better handle on the taxonomic literature (see BioNames) where the names of at least some dark taxa may lurk.

Wednesday, July 10, 2013

The challenge of semantically marking up articles (more thoughts on PLoS Hubs)

Here's a sketch of my vision of how to make something like the original PLoS Hubs vision (see The demise of the @PLoS Biodiversity Hub: what lessons can we learn? for background). In the blog post explaining the vision behind PLoS Hubs (Aggregating, tagging and connecting biodiversity studies), David Mindell wrote:

The Hub is a work in progress and its potential benefits are many. These include enhancing information exchange and databasing efforts by adding links to existing publications and adding semantic tagging of key terms such as species names, making that information easier to find in web searches. Increasing linkages and synthesizing of biodiversity data will allow better analyses of the causes and consequences of large scale biodiversity change, as well as better understanding of the ways in which humans can adapt to a changing world.

So, first up, some basic principles. The goal (only slightly tongue in cheek) is for the user to never want to leave the site. Put another way, I want something as engaging as Wikipedia, where almost all the links are to other Wikipedia pages (more on this below). This means object (publication, taxon name, DNA sequence, specimen, person, place) gets a "page". Everything is a first class citizen.

It seems to me that the fundamental problem faced by journals that "semantically enhance" their content by adding links is that those links go somewhere else (e.g., to a paper on another journal's site, to an external database). So, why would you do this? Why actively send people away?. Google can do this because you will always come back. Other sites don't have this luxury. In the case of citations (i.e., adding DOIs to the list of literature cited) I guess the tradeoff is that because all journals are in this together, you will receive traffic from your "rivals" if papers they publish papers that cite your content. So you are building a network that will (potentially) drive traffic to you (as well as send people away). You are building a network across which traffic can flow (i.e., the "citation graph").

If we add other sorts of links (say to GenBank, taxon databases, specimen databases like GBIF, locality databases such as Protected Planet, etc.) then that traffic has no obvious way of coming back. This also has a deeper consequence, those databases don't "know" about these links, so we loose the reverse citation information. For example, a semantically marked-up paper may know that it cites a sequence, but that sequence doesn't know that it has been cited. We can't build the equivalent citation graphs for data.

One way to tackle this is to bring the data "in house", and this is what Pensoft are doing with their markup of ZooKeys. But they are basically creating mashups of external data sources (a bit like iSpecies). We need to do better.

Prototyping

One way to make this more concrete is to think how a hub could be prototyped. Let's imagine we start with a platform like Semantic MediaWiki. I've an on-again/off-again relationship with Semantic MediaWiki, it's powerful but basically a huge, ugly hack on top off a huge piece of software that wasn't designed for this purpose, but it's a way to conceptualise what can be done. So, I'm not arguing that Semantic MediaWiki is how to do this for real (trust me, I'm really, really not), but that it's a way to explore the idea of a hub.

OK, let's say we start with a paper, say an Open Access paper from PLoS. We parse the XML, extract every entity that matters (authors, cited articles, GenBank accessions, taxon names, georeferenced localities, specimen codes, chemical compounds, etc.) and then create pages for each entity (including the article itself). Each page has a URL that uses an external id (where available) as a "slug" (e.g., the URL for the page for a GenBank sequence includes the accession number).

Now we have exploded the article into a set of pages, which are linked to the source article (we can use Semantic MediaWiki to specify the type of those links), and each entity "knows" that it has a relationship with that article.

Then we set about populating each page. The article page is trivial, just reformat the text. For other entities we construct pages using data from external databases wherever possible (e.g., metadata about a sequence from GenBank).

So far we have just one article. But that article is linked to a bunch of other articles (those that it cites), and there may be other less direct links (e.g., GenBank sequences are linked to the article that publishes them, taxonomic names may be linked to articles that publish the name, etc.). We could add all these articles to a queue and process each article in the same way. In a sense we are now crawling a graph that includes the citation graph, but it includes links the citation graph misses (such as articles that cite the same data, see Enhanced display of scientific articles using extended metadata for more on this).

The first hurdle we hit will be that many articles are not open access, in which case they can't be exploded into full text and associated entities. But that's OK, we can create article pages that simply display the article metadata, and link(s) to articles in the citation graph. Furthermore, we can get some data citation links for closed access articles, e.g. from GenBank.

So now we let the crawler loose. We could feed it a bunch of articles to start with (e.g., those in the original Hub (if that list still exists), those from various RSS feeds (e.g., PLoS, ZooKeys, BMC, etc.).

Entry points

Users can enter the hub in various ways. Via an article would be the traditional way (e.g., via a link from the article itself on the PLoS site). But let's imagine we are interested in a particular organism, such as Macaroni Penguins. PLoS has an interesting article on their feeding [Studying Seabird Diet through Genetic Analysis of Faeces: A Case Study on Macaroni Penguins (Eudyptes chrysolophus) doi:10.1371/journal.pone.0000831]. This article includes sequence data for prey items. If we enhance this article we build connections between the bird and its prey. In the simple level, the hub page for the crustacea and fish it feed on would include citations to this article, enabling people to follow that connection (with more sophistication the nature of the relationship could also be specified). This article refers to a specific locality and penguin colony, which is in a marine reserve (Heard Island and McDonald Islands Marine Reserve). Extract these entities, and other papers relevant to this are would be linked as they are added (e.g., using a geospatial query). Hence people interested in what we now about the biology of organisms in specific localities would dive in via the name (or URL) of the locality.

Summary

The core idea here is taking an article, exploding it, and treating every element as a first class citizen of the resulting database. It is emphatically not about a nice way to display a bunch of articles, it's not a "collection", and articles aren't privileged. Nor are articles from any particular publisher privileged. One consequence of this is that it may not appeal to an individual publisher because it's not about making a particular publisher's content look better than another's (although having Open Access content in XML makes it much easier to play this game).

The real goal of an approach like this is to end up with a database (or "knowledgebase") that is much richer than simply a database of articles (or sequences, or specimens), and which moves from being a platform for repurposing article text to a platform for facilitating discovering.

Tuesday, July 09, 2013

The problem with apps for journals

I'm a big fan of the work Nature publishing group is doing in experimenting with new methods of publishing, such as their iOS apps (which inspired me to "clone" their iPhone app) and the Encode app with the concept of "threads". But there's an aspect of the iPad app that puzzles me: Nature's app doesn't know that a linked article elsewhere in Nature is also available using the app. To illustrate, consider this piece on horse evolution (doi:10.1038/nature.2013.13261) which is available in the iPad app (with a subscription):

Image 1

This News & Comment piece links to the main article (doi:10.1038/nature12323), which is also available in the app:

Image 4

But if I click on the link in the app, I get the article displayed in a web browser view (just as if I was in a browser on the iPad, or any other device):

Image 3

This is odd, to say the least. If the article is available in the app (and much more attractively formatted than on the web) why doesn't the app "know" this and take me to that view?

In a sense this is part of a larger problem with publisher-specific apps. If the article isn't part of the publisher's catalogue then you are off to the web (or, indeed, another publisher's app), which makes for jarring reading experience. Each web site (or app) has its own way of doing things). Part of this is because different publishers represent different silos, mostly locked behind paywalls.

We can extend this argument to links to other cited entities, such as sequences, specimens, chemicals, etc. In most cases these aren't linked, and if they are, we are whisked off to web site somewhere else (say GenBank), a web site furthermore that typically knows nothing about the article we came from (e.g., doesn't know about the citation relationship we've just traversed). I think we can do better than this, but it will need us to treat links as more than simply jump off points to the wider web. For example, if the Nature app not only knew about all the Nature articles that were available to it, but also stored information about DNA sequences, chemical compounds, taxa, and other entities, then we could have a richer reading experience with potentially denser cross links (e.g., the app could display a genome and list all the Nature articles that cite that genome). Of course, this is still limited by publisher, and ultimately we want to break that silo as well. This is the attraction of starting with Open Access publications so that we can link articles across publishers, but still navigate through their content in a consistent way.

The demise of the @PLoS Biodiversity Hub: what lessons can we learn?

2etoq0zjwxicokm1woge
Jonathan Eisen recently wrote that the PLOS Hub for Biodiversity is soon to be retired, and sure enough it's vanished from the web (the original URL hubs.plos.org/web/biodiversity/ now bounces you straight to http://www.plosone.org/, you can still see what it looked like in the Wayback Machine).

Like Jonathan, I was involved in the hub, which was described in the following paper:
Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. (S. A. Rands, Ed.)PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491

In retrospect PLoS's decision to pull the hub is not surprising. The original proposal imagined a web site looking like this, with the goal of building a "dynamic community".

Proposal

From my perspective the PLoS HUb failed for two reasons. The first is that PLoS weren't nearly as ambitious as they could have been. The second is that the biodiversity informatics community simply couldn't (an arguably still can't) provide the kind of services that PLoS would have needed to make the Hubs something worth nurturing.

After a meeting at the California Academy of Science in April 2010 to discuss the hub idea I wrote a ranty blog post (Biodiversity informatics = #fail (and what to do about it)) where I expressed my frustration that we had a group of people (i.e., PLoS) rock up and express serious interest in doing something with biodiversity data, and biodiversity informatics collectively failed them. We could have been aiming for a cool database of "semantically enhanced" publications that we could query taxonomically, geographically, phylogenetically, etc. (at least, that's what I was hoping PLoS were aiming for). Instead it became clear that most of the basic services were simply not available (we didn't have a simple code to extract GenBank accession numbers, specimens codes, etc., we couldn't link specimen codes to anything online, and woe betide you if you asked what a taxon name was).

In fairness, it also became pretty clear that PLoS weren't going to go too far down the line of an all-singing portal to biodiversity data. They were really looking at a shiny web site that housed a collection of Open Access papers on biodiversity. But my point is it could have been so much more than that. We had a chance to build a platform,a knowledge base for biodiversity data that had an accessible front end (e.g., the traditional publication) but exploded that into its component parts so we could spin the data around and ask other questions.

Inspired by the possibilities I spent the next couple of months playing with some linked data demos (see here and here, the links in these demos have long since died). The idea was to explore how much of what I imagined the PLoS Hub could be it was possible to build using RDF and SPARQL. It was fun, but RDF and SPARQL are awful things to "play" with, and the vast bulk of the data had to be wrapped in custom scripts I wrote because the original data providers didn't supply RDF. As I've written elsewhere, I think the cost of getting to a place where RDF enables you to do meaningful stuff is just too high. Our data are too messy, we lack agreed identifiers, and we either have too many or too few vocabularies (and those we do have invariably spark lengthy, philosophical debates - vocabularies are taxonomies of data, need I say more). The RDF approach is also doomed to fail because it assumes multiple decentralised data repositories are the way forward. In my experience, these cannot deliver the kinds of things we need. The data need to be brought together, cleaned, aligned, augmented, and finally linked together. This is much easier to do if all the data are in one place.

So where does this leave us? In many ways I'd like to attempt something like PLoS Hubs again, or perhaps more precisely, think about building a platform so that if a publisher came along and wanted to do something similar (but more ambitious) we would have the tools in place that could make it happen. What I'd like is a way more sophisticated version of this, where you could explore data in various dimensions (geography, taxonomy, phylogeny), track citation and provenance information (what papers cite this specimen, what sequences is it a voucher for, what trees are built on those sequences). If we had a platform that supported these sorts of queries, not only could we provide great environment upon which we could embed scientific publications, we could also support the kinds of queries we can't do at the moment (e.g., give me all the molecular phylogenies for species in Madagascar, locate all the data - publications, taxonomic identifications, sequences - about a specimen, etc.).

I'll leave you with a great rant about platforms. It's long but it's fun, and I think it speaks to where we are now in biodiversity informatics (hint, we aren't Amazon).