Tuesday, July 09, 2013

The demise of the @PLoS Biodiversity Hub: what lessons can we learn?

Jonathan Eisen recently wrote that the PLOS Hub for Biodiversity is soon to be retired, and sure enough it's vanished from the web (the original URL hubs.plos.org/web/biodiversity/ now bounces you straight to http://www.plosone.org/, you can still see what it looked like in the Wayback Machine).

Like Jonathan, I was involved in the hub, which was described in the following paper:
Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. (S. A. Rands, Ed.)PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491

In retrospect PLoS's decision to pull the hub is not surprising. The original proposal imagined a web site looking like this, with the goal of building a "dynamic community".


From my perspective the PLoS HUb failed for two reasons. The first is that PLoS weren't nearly as ambitious as they could have been. The second is that the biodiversity informatics community simply couldn't (an arguably still can't) provide the kind of services that PLoS would have needed to make the Hubs something worth nurturing.

After a meeting at the California Academy of Science in April 2010 to discuss the hub idea I wrote a ranty blog post (Biodiversity informatics = #fail (and what to do about it)) where I expressed my frustration that we had a group of people (i.e., PLoS) rock up and express serious interest in doing something with biodiversity data, and biodiversity informatics collectively failed them. We could have been aiming for a cool database of "semantically enhanced" publications that we could query taxonomically, geographically, phylogenetically, etc. (at least, that's what I was hoping PLoS were aiming for). Instead it became clear that most of the basic services were simply not available (we didn't have a simple code to extract GenBank accession numbers, specimens codes, etc., we couldn't link specimen codes to anything online, and woe betide you if you asked what a taxon name was).

In fairness, it also became pretty clear that PLoS weren't going to go too far down the line of an all-singing portal to biodiversity data. They were really looking at a shiny web site that housed a collection of Open Access papers on biodiversity. But my point is it could have been so much more than that. We had a chance to build a platform,a knowledge base for biodiversity data that had an accessible front end (e.g., the traditional publication) but exploded that into its component parts so we could spin the data around and ask other questions.

Inspired by the possibilities I spent the next couple of months playing with some linked data demos (see here and here, the links in these demos have long since died). The idea was to explore how much of what I imagined the PLoS Hub could be it was possible to build using RDF and SPARQL. It was fun, but RDF and SPARQL are awful things to "play" with, and the vast bulk of the data had to be wrapped in custom scripts I wrote because the original data providers didn't supply RDF. As I've written elsewhere, I think the cost of getting to a place where RDF enables you to do meaningful stuff is just too high. Our data are too messy, we lack agreed identifiers, and we either have too many or too few vocabularies (and those we do have invariably spark lengthy, philosophical debates - vocabularies are taxonomies of data, need I say more). The RDF approach is also doomed to fail because it assumes multiple decentralised data repositories are the way forward. In my experience, these cannot deliver the kinds of things we need. The data need to be brought together, cleaned, aligned, augmented, and finally linked together. This is much easier to do if all the data are in one place.

So where does this leave us? In many ways I'd like to attempt something like PLoS Hubs again, or perhaps more precisely, think about building a platform so that if a publisher came along and wanted to do something similar (but more ambitious) we would have the tools in place that could make it happen. What I'd like is a way more sophisticated version of this, where you could explore data in various dimensions (geography, taxonomy, phylogeny), track citation and provenance information (what papers cite this specimen, what sequences is it a voucher for, what trees are built on those sequences). If we had a platform that supported these sorts of queries, not only could we provide great environment upon which we could embed scientific publications, we could also support the kinds of queries we can't do at the moment (e.g., give me all the molecular phylogenies for species in Madagascar, locate all the data - publications, taxonomic identifications, sequences - about a specimen, etc.).

I'll leave you with a great rant about platforms. It's long but it's fun, and I think it speaks to where we are now in biodiversity informatics (hint, we aren't Amazon).