iPhylo: April 2010

Roderic D. M. Page

Friday, April 30, 2010

Mendeley Open API and the Biodiversity Heritage Library

Mendeley have called for proposals to use their forthcoming API. The API will publicly available soon, but in a clever move Mendeley will provide early access to developers with cool ideas.

Image by Mendeley.com

Given that the major limitation of the Biodiversity Heritage Library (from my perspective) is the lack of article-level metadata, and Mendeley has potentially lots of such data, I wonder whether this is something that could be explored. My BioStor project takes article metadata and finds articles in BHL, so an attractive work flow would be:

People upload bibliographies to Mendeley (e.g., bibliographies for particular taxa, journals, etc.)

BioStor uses Mendeley's API to find articles liklely to be in BHL, then locates the actual article in Mendeley.

The user could then grab a PDF of the article from BioStor that contains XMP metadata (which Mendeley, and other tools, can read)

Users would gain a tool to manage their bibliographies (assuming that they prefer Mendeley to other tools, or are happy to sync with Mendeley), they would be contributing to a database of taxonomic (and biological literature in general, BHL's content is pretty diverse), and also gain easy access to PDFs for BHL content (this last feature depends on whether Mendeley can associate a PDF with an existing bibliographic record automatically). In the same way, a tool such as BioStor (and, by implication, BHL) could gain usage statistics (i.e., who is reading these articles?).

Our communities efforts at assembling bibliographies haven't amounted to much yet. The tools we use tend to be poor. I find CiteBank to be underwhelming, and Drupal's bibliographic modules (used by CiteBank and ScratchPads) lack key features. We also seem reluctant to contribute to aggregated bibliographies. Perhaps encouraging people to use a nicer tool, and at the same time providing additional benefits (e.g., XMP PDFs) might help move things forward.

Anyway, food for thought. Perhaps other tools might make more sense, such as using the API to upload metadata and PDFs direct from BioStor to Mendeley, and making the collection public. But, if I were Mendeley, what I'd be looking for are tools that enhance the Mendeley experience. There's some obvious scope for visualising the output and social networks of authors, such as the sparklines and coauthor graphs I've been playing with in BioStor (for example, for W E Duellman):

Before this blog post starts to veer irretrievably off course, I'd be interested in thoughts of anyone interested in matters BHL. There's nothing like a deadline (Friday, May 14th) to concentrate the mind...

Sunday, April 25, 2010

Time for some decent service

The BBC web site has an article entitled Giant deep sea jellyfish filmed in Gulf of Mexico which has footage of Stygiomedusa gigantea, and mentions an associated fish, Thalassobathia pelagica.

One thing that frustrates me beyond belief is how hard it is to get more information about these organisms. Put another way, the biodiversity informatics community is missing a huge opportunity here. There are a slew of services, such as Zemanta and OpenCalais.com, that can enrich the content of a document by identifying terms and adding links. Imagine a similar service that took taxonomic names and could provide information and links about that name, so that sites such as the BBC could enrich their pages. We've had various attempts at this¹, but we are still far from creating something genuinely useful.

Part of the problem is that the plethora of taxonomic databases we have are often of little use. After fussing with Google I discover that Stygiomedusa gigantea (Browne, 1910) has the synonym Stygiomedusa fabulosa Russell, 1959 (see, e.g., the WoRMS database), but no database tells me that the genus Stygiomedusa was published by Russell in Nature in 1959 (doi:10.1038/1841527a0). Nor can I readily find the original reference for (Browne, 1910) in these databases². Why is this so hard?

Then when we do have information, we fail to make it digestible. For example, the EOL page for Thalassobathia pelagica links to BHL pages, but fails to point out that the pages it links belong to a single article, and that this article (http://biostor.org/reference/4339) is the original description of the fish.

Publishers are increasingly interested in any tools that can embellish their content. The organisation that gets their act together and provides a decent service for publishers (including academic journals, and news services such as the BBC) is going to own this space. Any takers...?

Such as uBio LinkIT and EOL NameLink.
After finding another taxon with the author Browne 1910 in BHL, I found Diplulmaris (?) gigantea, which looked like a good candidate for the original name for the jellyfish, see http://biodiversitylibrary.org/page/1727009. This is confirmed by the Smithsonian's Antarctic Invertebrates site.

Thursday, April 22, 2010

What I want from a web phylogeny viewer - XML, SVG and Newick round tripping

Random half-formed idea time. Thinking about marking up an article (e.g., from PLoS) with a phylogeny (such as the image below, see doi:10.1371/journal.pone.0001109.g001), I keep hitting the fact that existing web-based tree viewers are, in general, crap.

Given that a PLoS article is an XML document, it would be great if the tree diagram was itself XML, in particular SVG. But, in one sense, we don't want just a diagram, we want access to the underlying tree (for example, so we can play with it in other software). The tree may or may not be available in TreeBASE, but what if the diagram itself was the tree? In other words, imagine a tree viewing program could output SVG, structured in such a way that with a XSLT stylesheet the underlying tree could be extracted (say in Newick or, gack, NexXML) from the SVG, but users could take the SVG and embellish it (in Adobe Illustrator or Inkscape). The nice illustration and the tree data structure would be one and the same thing! No getting tree and illustration out of sync, and no hoping authors have put tree in a database somewhere -- the article contains the tree.

In order for this to happen, we need a tree viewer that exports SVG, and ideally would allow annotation so that the author could do most of the work within that program (ensuring that the underlying tree object isn't broken by graphic editing). Then export the SVG, add extract bits in Illustrator/Inkscape if needed, and have it incorporated into the article XML (which is what the publisher uses to render the article on the web). Simples.

Sunday, April 18, 2010

Elsevier Grand Challenge paper out

At long last the peer-reviewed version of the paper "Enhanced display of scientific articles using extended metadata" (doi:10.1016/j.websem.2010.03.004), in which I describe my entry in the Elsevier Grand Challenge, has finally appeared in the journal Web Semantics: Science, Services and Agents on the World Wide Web. The pre-print version of this paper has been online (hdl:10101/npre.2009.3173.1) for a year prior to appearance of the published version (24 April 2009 versus 3 April 2010), and the Challenge entry itself went online in December 2008. Unfortunately the published version has an awful typo in the title (that was in neither the manuscript nor the proofs).

Given this typo, the time lag between doing the work, writing the manuscript, and seeing it published, and the fact that I've already been to meetings where my invitation has been based the entry and the pre-print, I do wonder why on Earth would I bother with traditional publication (which is somewhat ironic, given the topic of the paper)?

Wednesday, April 14, 2010

Biodiversity informatics = #fail (and what to do about it)

The context for this post is the PLos markup meeting held at the California Academy of Sciences over the weekend (many thanks to Brian Fisher for the invitation). PLoS are launching a "biodiversity hub" and were looking for ideas on how to implement this. The fact that nobody -- least of all those attending from PLoS -- could adequately explain what a hub was made things a tad tricky, but that didn't matter, because PLoS did know when the first iteration of the hub was going live (later this summer). So, once we got past the fact that PLoS operates with a timeline that says "cool stuff will happen here" then sets about figuring what that cool stuff will actually be (in retrospect you gotta admire this approach), we then tried to figure out what PLoS needed from us.

That's when things got messy. It became very clear that PLoS wanted basic things like, you know, information on names, being able to link to specimens, etc., and our community can't do this, at least not yet. Nor can we provide simple answers to simple questions. For example, Rich Pyle, gave an overview of taxonomic names, nomenclature, concepts, and the horrendous alphabet soup of databases (uBio, ZooBank, IPNI, IndexFungorum, GNA, GNUB, GNI, CoL, etc.) that have a stake in this. You could see the look of horror in the eyes of the PLoS developers who were tasked with making the hub happen ("run away, run away now"). And this was after the simple version of things. In a week where taxonomy was in the news because of the possibility that Drosophila melanogaster would have to, *cough*, change its name (doi:10.1038/464825a)¹, this was not a great start.

At each step when we outlined some of the stuff that would be cool, it became clear we couldn't deliver what we were actually arguing PLoS should do. For example, we have millions of digitised specimen records, and lots of papers refer to these specimens by name, but because individual specimens don't have URIs we can't refer to them (instead we have horrific query interfaces like TAPIR, see Accessing specimens using TAPIR or, why do we make this so hard?). We're digitising the taxonomic literature, but don't provide a way to link this to modern literature at the level of granularity publishers use (i.e., articles).

Readers of this blog will have heard this all before, but what made this meeting different was we actually had a "customer" rock up and ask for our help to enhance their content and create something useful for the community...and the best we could do was um and er and confess we couldn't really give them what they wanted².

Think of the children
It's time biodiversity informatics stopped playing "let's make an acronym", stopped trying to keep taxonomists happy (face it, that's never going to happen, and frankly, they'll be extinct soon anyway), and stopped obsessing with who owns the data, and instead focus on delivering some simple, solid, services that address the needs of people who, you know, will actually do something useful with them. Otherwise we'll be like digital librarians, who thought people would search the way librarians do, then got their nose out of joint when Google ate their lunch.

It's time to make some simple services, and stop the endless cycle of inward looking meetings where we talk to each other. We need to learn to hide what people don't need (nor want) to see. We need to be able to:

Extract entities from text, e.g. scientific names, specimen codes, localities, GenBank accession numbers.

Lookup a taxonomic name and return basic information about that name (rather like iSpecies but as a service).

Make specimen codes resolvable.

Make taxonomic literature accessible using identifiers and tools publishers know about (that means DOIs and OpenURL).

We're close to a lot of this already, but we're still far enough away to make some of this non-trivial. And we keep having meetings about this stuff, and fail to actually get it done. Something is wrong somewhere when E O Wilson has his name on yet another call for megabucks for a biodiversity project (the "Barometer of Life, doi:10.1126/science.1188606). At what point will someone ask "um, we've given you guys a lot of money already, why can't you tell me the stuff we need to know?"

Let me just say that I'm a short term pessimist, but a long term optimist. The things I complain about will get fixed, one day. It's just that I see little evidence they'll get fixed by us. Prove me wrong, go on, I dare you...

Personally I'm intensely relaxed about Drosophila melanogaster remaining Drosophila melanogaster, even if it ends up in a clade surrounded by flies with other generic names. Having (a) a stable name and (b) knowing where it fits in the tree of life is all we need to do science.

At the meeting I couldn't stop thinking of the scene in The West Wing where President Bartlett walks up to the Capitol for an impromptu meeting with the Speaker of the House to sort out the budget, and is left waiting outside while the Speaker sorts out his game plan. By the time the Speaker is ready, the President has turned on his heels and left, making the Speaker look a tad foolish.

Thursday, April 08, 2010

BioStor gets PDFs with XMP metadata - bibliographies made easy with Mendeley and Papers

The ability to create PDFs for the articles BioStor extracts from the Biodiversity Heritage Library has been the single most requested feature for BioStor. I've taken a while to get around to this -- for a bunch of reasons -- but I've finally added it today. You can get a PDF of an article by either clicking on the PDF link on the page for an article, or by appending ".pdf" to the article URL (e.g., http://biostor.org/reference/570.pdf). In some ways the BioStor PDFs are pretty basic - they contain page images, not the OCR text, so they tend to be quite large and you can't search for text within the article. But what they do have is XMP metadata.

XMP metadata

One of the great bugbears about organising bibliographies is the lack of embedded metadata in PDFs, in other words Why can't I manage academic papers like MP3s? (see my earlier post for some background). Music files and digital photos contain embedded metadata that store information such as song title and artist in the case of music, or date, exposure, camera model, and GPS co-ordinates in the case of digital images. This means software (and webs sites such as Flickr) can automatically organise your collection of media based on this embedded metadata.

Wouldn't it be great if there was an equivalent for PDFs of papers, whereby the PDF contains all the relevant the bibliographic details (article title, authorship, journal, volume, pages, etc.), and reference managing software could read this and automatically put the PDF into whatever categories you chose (e.g., by author, journal, or date)? Well, at least two software programs can do this, namely the cross-platform Mendeley, and Papers, which supports Apple's Macintosh, iPhone, and iPad platforms. Both programs can read bibliographic metadata in Adobe's Extensible Metadata Platform (XMP), which has been adopted by journals such as Nature, and CrossRef has recently been experimenting in providing services to add XMP to PDFs.

One reason I put off adding PDFs to BioStor was the issue of simply generating dumb PDFs for which users would then have to retype the corresponding bibliographic metadata if they wanted to store the PDF in a reference manager. However, given that both Papers and Mendeley support XMP, you can simply drag the PDF on to either program and they will extract the details for you (including a list of up to 10 taxonomic names found in the article). Both Papers and Mendeley support the notion of a "watched folder" where you can dump PDFs and they will "automagically" appear in your reference manager's library. Hence, if you use either program you should be able to simply download PDFs from BioStor and add them to your library without having to retype anything at all.

Technical details
This post is being written as I'm waiting to catch a plane, so I haven't time to go into all the gory details. The basic tools I used to construct the PDfs were FPDF and ExifTool, which supports injecting XMP into PDFs (I couldn't find another free tool that could insert XMP into a PDF that didn't already have any XMP metadata). I store basic Dublin Core and PRISM metadata in the PDF. The ten most common taxonomic names found in the pages of the article are stored as subject tags.

Initially it appeared that only Papers could extract XMP, Mendeley failed completely (somewhat confirming my prejudices about Mendeley). However, I sent an example PDF to Mendeley support, and they helpfully diagnosed the problem. Because XMP metadata can't always be trusted, Mendeley compares title and author values in the XMP metadata with text on the first couple of pages of the PDF. If they match, then the program accepts the XMP metadata. Because my initial efforts as creating PDfs just contained the BHL page images and no text, they wouldn't pass Mendeley's tests. Hence, I added a cover page containing the basic bibliographic metadata for the article, and now Mendeley is happy (the program itself is growing on me, but if you're an Apple fanboy like me, Papers has native look and feel, and syncing your library with your iPhone is a killer feature). There are a few minor differences in how Papers and Mendeley handle tags. Papers will take text in the Dublin Core "Subject" tag and use those as keywords, whereas to get Mendeley to extract tags I had to store them using the "Keywords" tag (implemented using FPDF's SetKeywords function). But after a bit of fussing I think the BioStor PDFs should play nice in both programs.