iPhylo: DSpace

Roderic D. M. Page

Showing posts with label DSpace. Show all posts

Friday, May 22, 2009

Dryad, DOIs, and why data matters more than journal articles

For the last two days I've been participating in a NESCent meeting on Dryad, a "repository of data underlying scientific publications, with an initial focus on evolutionary biology and related fields". The aim of Dryad is to provide a durable home for the kinds of data that don't get captured by existing databases such as GenBank and TreeBASE (for example, the Excel spreadsheets, Word files, and tarballs of data that, if they are lucky, make it on to a journal's web site as supplementary material (like this example). These data have an alarming tendency to disappear (see "Unavailability of online supplementary scientiﬁc information from articles published in major journals" doi:10.1096/fj.05-4784lsf).

Perhaps it was because I was participating virtually (via Adobe Connect, which worked very well), but at times I felt seriously out of step with many of the participants. I got the sense that they regard the scientific article as primary, data as secondary, and weren't entirely convinced that data needed to be treated in the same way as a publication. I was arguing that Dryad should assign DOIs to data sets, join CrossRef, and ensure data sets were cited in the same way as papers. For me this is a no brainer -- by making data equivalent to a publication, journals don't need to do anything special, publishers know how to handle DOIs, and will have fewer qualms than handling URLs, which have a nasty tendency to break (see "Going, Going, Gone: Lost Internet References" doi:10.1126/science.1088234).

Furthermore, being part of CrossRef would bring other benefits. Their cited-by linking service enables publishers to display lists of articles that cite a given paper -- imagine being able to do this for data sets. Dryad could display not just the paper associated with publication of the data set, but all subsequent citations. As an author, I'd love to see this. It would enable me to see what others had done with my data, and provide an incentive to submit my data to Dryad (providing incentives to authors to archive data is a big issue, see Mark Costello's recent paper doi:10.1525/bio.2009.59.5.9).

Not everyone saw things this way, and it's often a "reality check" to discover that things one takes for granted are not at all obvious to others (leading to mutual incomprehension). Many editors, understandably, think of the the journal article as primary, and data as something else (some even struggle to see why one would want to cite data). There's also (to my mind) a ridiculous level of concern about whether ISI would index the data. In the age of Google, who cares? Partly these concerns may reflect the diversity of the participants. Some subjects, such as phylogenetics, are built on reuse of previous data, and it's this reuse that makes data citation both important and potentially powerful (for more on this see my papers hdl:10101/npre.2009.3173.1 and doi:10.1093/bib/bbn022). In many ways, the data is more important than the publication. If I look at a phylogenetics paper published, say 5 or more years ago, the methods may be outmoded, the software obsolete (I might not be able to run it on a modern machine), and the results likely to be outdated (additional data and/or taxa changing the tree). So, the paper might be virtually useless, but the data continues to be of value. Furthermore, the great thing about data (especially sequence data) is that it can be used in all sorts of unexpected ways. In disciplines such as phylogenetics, data reuse is very common. In other areas in evolution and ecology, this might not be the case.

It will be clear from this that I buy the idea articulated by Philip Bourne (doi:10.1371/journal.pcbi.0010034) that there's really no difference between a database and a journal article and that the two are converging (I've argued for a long time that the best thing that could happen to phylogeneics would be if Molecular Phylogenetics and Evolution and TreeBASE were to merge and become one entity). Data submission would equal publication. In the age of Google where data is unreasonably effective (doi:10.1109/mis.2009.36, PDF here), privileging articles at the expense of data strikes me as archaic.

So, whither Dryad? I wish it every success, and I'm sure it will be a great start. There are some very clever people behind it, and it takes a lot of work to bring a community on board. However, I think Dryad's use of Handles is a mistake (they are the obvious choice of identifier given Dryad is based on DSpace), as this presents publishers with another identifier to deal with, and has none of the benefits of DOIs. Indeed, I would go further and say that the use of Handles + DSpace marks Dryad as being basically yet another digital library project, which is fine, but it puts it outside the mainstream of science publishing, and I think that is a strategic mistake. An example of how to do things better is Nature Precedings, which assigns DOIs to manuscripts, reports, and presentations. I think the use of DOIs in this context demonstrated that Nature was serious, and valued these sorts of resource. Personally, I'd argue that Dryad should be more ambitious, and see itself as a publisher, not a repository. In fact, it could think of itself as a journal publisher. Ironically, maybe the editors at the NESCent meeting were well advised to be wary, what they could be witnessing is the formation of a new kind of publication, where data is the article, and the article is data.

Wednesday, May 30, 2007

AMNH, DSpace, and OpenURL

Hate my tribe. Hate them for even asking why nobody uses library standards in the larger world, when “brain-dead inflexibility in practice” is one obvious and compelling reason, and “incomprehensibility” is another.

... $DEITY have mercy, OpenURL is a stupid spec. Great idea, and useful in spite of itself. But astoundingly stupid. Ranganathan preserve us from librarians writing specs! - Caveat Lector

OK, we're on a roll. After adding Journal of Arachnology and Pysche to my OpenURL resolver, I've no added the American Museum of Natural History's Bulletins and Novitates.

In an act of great generosity, the AMNH has placed its publications on a freely accessible DSpace server. This is a wonderful resource provided by one of the world's premier natural history museums (and one others should follow), and is especially valuable given that volumes of the Bulletins and Novitates post 1999 are also hosted by BioOne (and hence have DOIs), but these versions of the publications are not free.

As blogged earlier on SemAnt, getting metadata from DSpace in a actually usable form is a real pain. I ended up writing a script to pull everything off via the OAI interface, extract metadata from the resulting XML, do a DOI look-up for post 1999 material, then dump this into the MySQL server so my OpenURL service can find it.

Apart from the tedium of having to find the OAI interface (why oh why do people make this harder than it needs to be?), the metadata served up by the AMNH is, um, a little ropey. They use Dublin Core, which is great, but the AMNH makes a hash of using it. Dublin Core provides quite a rich set of terms for describing a reference, and guidelines on how to use it. The AMNH uses the same tag for different things. Take date, for example:


<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>1946</dc:date>

Now, one of these dates is the date of publication, the others are dates the metadata was uploaded (or so I suspect). So, why not use the appropriate terms? Like, for instance, <dcterms:created>. Why do I have to parse three fields, and intuit that the third one is the date of publication. Likewise, why have up to three <dc:title> fields, and why include an abbreviated citation in the title? And why for the love of God, format that citation differently for different articles!? Why have multiple <dc:description> fields, one of which is the abstract (and for which <dcterms:abstract> is available?). It's just a mess, and it's very annoying (as you can probably tell). I can see some hate library standards.

Anyway, after much use of Perl regular expressions, and some last minute finessing with Excel, I think we now have the AMNH journals available through OpenURL.

For a demo, go to David Shorthouse's list of references for spiders, say the letter P and click on the bioGUID symbol by a paper by Norm Platnick in the American Museum novitates.