Friday, November 28, 2008

Challenge entry

I've submitted my entry for the Elsevier Grand Challenge. The paper describing the entry is available from Nature Precedings (doi:10.1038/npre.2008.2579.1). The web site demo is at http://iphylo.org/~rpage/challenge/www/. I'm now officially knackered.

Wednesday, November 26, 2008

Sequencing Carmen Electra


One byproduct of playing with the Challenge Demo is that I come across some rather surprising results. For example, the rather staidly titled "Cryptic speciation and paraphyly in the cosmopolitan bryozoan Electra pilosa—Impact of the Tethys closing on species evolution" (doi:10.1016/j.ympev.2007.07.016) starts to look a whole lot more interesting given the taxon treemap (right).

The girl is Carmen Electra, which is understandable given the Yahoo image search was for "Electra" (a genus of bryozoan). However, what are the wild men (and women) doing at the top? Turns out this is the result of searching for the genus Homo. But why, you ask, does a paper on bryozoans have human sequences? Well, looks like the table in the paper has incorrect GenBank accession numbers. The sequences AJ711044-50 should, I'm guessing, be AJ971044-50.

Ironically, although it was Carmen Electra's photo that initially made me wonder what was going on, it's really the hairy folks above her image that signal something is wrong. I've come across at least one other example of a paper citing an incorrect sequences, so it might be time to automate this checking. Or, what is probably going to be more fun, looking at treemaps for obviously wrong images and trying to figure out why.

Challenge Demo online

I've put the my Elsevier Challenge demo online. I'm still loading data into it, so it will grow over the next day or so. There's also the small matter of writing a paper on what's under the hood of the demo. Feel free to leave comments on the demo home page.

For some example of what the project does, take a look at Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio), then compare it to the same publication in Science Direct (doi:10.1016/j.ympev.2007.06.010).

Monday, November 24, 2008

What is a study about? Treemaps of taxa


One of the things I've struggled with most in putting together a web site for the challenge is how to summarise that taxonomic content of a study. Initially I was playing with showing a subtree of the NCBI taxonomy, highlighting the taxa in the study. But this assumes the user is familiar with the scientific names of most of life. I really wanted something that tells you "at a glance" what the study is about.

I've settled (for now, at least) on using a treemap of images of the taxa in the study. I've played with treemaps before, and have never been totally convinced of their utility. However, in this context I think they work well. For each paper I extract the taxonomic names (via the Genbank sequences linked to the paper), group them into genera, and then construct a treemap where the size of each cell is proportional to the number of species in each genus. Then I harvest images from Flickr and/or Yahoo's image search APIs and display a thumbnail with a link to the image source.


I'm hoping that these treemaps will give the user an almost instant sense of what the study is about, even if it's only "it's about plants". The treemap above is for Frost et al.'s The amphibian tree of life (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2), the one to the right is for Johnson and Weese's "Geographic distribution, morphological and molecular characterization, and relationships of Lathrocasis tenerrima (Polemoniaceae)".

Note that the more taxa a study includes the smaller and more numerous the cells (see below). This may obscure some images, but gives the user the sense that the study includes a lot of taxa. The image search isn't perfect, but I think it works well enough for my purposes.

Saturday, November 22, 2008

Elsevier Grand Challenge Video


Elsevier have released this video about the challenge, featuring a few of the contestants. I couldn't get my act together in time to send anything useful, and having seen the 16 gigabytes song (full version here), I'm glad I didn't -- there's just no way I could compete with Michael Greenacre and Trevor Hastie.

Thursday, November 20, 2008

OpenRef and bioGUID

One of the judges for the Elsevier Article 2.0 Contest is Andrew Perry, whose blog has some posts on Noel O'Boyle's OpenRef idea (see DOI or DOH? Proposal for a RESTful unique identifier for papers). Andrew discusses some implementations he has come up with, and compares OpenRef with OpenURL. This prompted me to add OpenRef-style identifiers to bioGUID's OpenURL resolver.

Basically, OpenRef is a human-readable identifier for an article, based on concatenating the journal name, year of publication, volume number, and starting page, for example:

openref://BMC Bioinformatics/2007/8/487

The equivalent OpenURL link would be

http://bioguid.info/openurl/?genre=article &title=BMC%20Bioinformatics &date=2007 &volume=8 &spage=487

Andrew notes:
A key cosmetic (and philosophical) difference between OpenURL and OpenRef/ResolveRef URLs is that OpenURL uses HTTP GET fields, eg ?title=bla&issn=12345, while OpenRef/ResolveRef uses the URL path itself eg, somejournalname/2008/4/1996. It’s a bit like one scheme was designed in the age of CGI scripts, while the other was designed for web applications capable of more RESTful behaviour. In my mind OpenURL is more versatile but much uglier, while OpenRef is cleaner and simpler but can only reference journal articles.
Of course, it is straightforward to add openref-style URLs to an OpenURL resolver by using URL rewriting, for example:


RewriteRule ^openref/(.*)/([0-9]{4})(.*)/(.*)
openurl.php?title=$1&date=$2&volume=$3&spage=$4&genre=article [NC,L]

I've done this for my resolver. One limitation of OpenRef is that there are many different ways to write a journal's name, so you can't determine whether two OpenRef's refer to the same journal by simply string matching (as you can with a DOI, for example -- if the DOI's are different the article is different). For example I might write BMC Bioinformatics and you might write BMC Bioinf.. One way around tis is to have unique identifiers for journals, which of course is the approach Robert Cameron advocated with Universal Serial Item Names and JACC's. The obvious candidate for journal identifier is the ISSN. I guess the problem here is that it's easier to use the journal name rather than require the user to know the ISSN. OpenRefs are certainly easier to write. Hence, I think they are great as a simple way for people to construct a resolvable URL for an artcle, but not so great as an identifier.

Elsevier Article 2.0 Contest

Chris Freeland's tweet alterted me to the Elsevier Article 2.0 Contest:
Elsevier Labs is inviting creative individuals who have wanted the opportunity to view and work with journal article content on the web to enter the Elsevier Article 2.0 Contest. Each contestant will be provided online access to approximately 7,500 full-text XML articles from Elsevier journals, including the associated images, and the Elsevier Article 2.0 API to develop a unique yet useful web-based journal article rendering application. What if you were the publisher? Show us your preference!
Elsevier are clearly looking for ideas (they also have their Grand Challenge), and there's been some interesting commentary on the Article 2.0 contest.
The site provides some sample applications (written in XQuery), which you can play with by going to the list of journals that are included in the challenge and clicking down through volume and issue until you get to individual articles.

Saturday, November 15, 2008

EOL on CBS


Watch CBS Videos Online
CBS News Sunday Morning Segment on the EOL. All fun stuff (Paddy skewering the interviewer who fails to recognise an echidna), but still long on promises and short on actual product.

Monday, November 10, 2008

Rewriting DOIs

One problem with my cunning plan to use Mediawiki REDIRECTs to handle DOIs is that some DOIs, such as those that BioOne serves based on SICIs contain square brackets, [ ], which conflicts with wiki syntax. For example, doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2 I want to enable users to enter a raw DOI, so I've been playing with a simple URL rewrite in Appache httd.conf, namely:

RewriteRule ^/wiki/doi:(.*)\[(.*)\](.*)$ /w/index.php?title=Doi:$1-$2-$3 [NC,R]

This rewrites the [ and ] in the original DOI, then forces a new HTTP request (hence the [NC,R] at the end of the line). This keeps Mediawiki happy, at the cost of the REDIRECT page having a DOI that looks a slightly different from the original. However, it means the user can enter the original DOI in the URL, and not have to manually edit it.

From bibliographic coupling to data coupling

Bibliographic coupling is a term coined by Kessler (doi:10.1002/asi.5090140103) in 1963 as a measure of similarity between documents. If two documents, A and B, cite a third, C, then A and B are coupled.

I'm interested in extending this to data, such as DNA sequences and specimens. In part this is because within the challenge dataset I'm finding cases where authors cite data, but not the paper publishing the data. For example, a paper may list all the DNA sequences in uses (thus citing the original data), but not the paper providing the data.

To make this concrete, the paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls in Myiarchus tyrant-flycatchers (Aves: Passeriformes: Tyrannidae)" doi:10.1016/S1055-7903(03)00259-8 lists the sequences used, but does not cite the source of three of these (which is the Science paper "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" (doi:10.1126/science.1065005). As a result, if I was reading "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" and wanted to learn who had cited it I would miss the fact that paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls..." had used the data (and hence, in effect, "cited" the paper). In some cases, data citation may be more relevant than bibliographic citation because it relates to people using the data, which seems a more significant action than simply reading the paper.

Note that I'm not interested in the issue of credit as such. In the above example, the authors of the Science paper are also coauthors of the "shakes, rattles, and rolls" paper, and hence show commendable restrain in not citing themselves. I'm interested in the fate of the data. Who has used it? What have they done with it? Has anybody challenged the data (for example, suggesting a sequence was misindentified)? These are the things that a true "web of data" could tell us.

Wednesday, November 05, 2008

Defrosting the Digital Library


Duncan Hull alerted me to his paper "Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web" (PloS Computational Biology, doi:10.1371/journal.pcbi.1000204). Here's the abstract:

Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as “thought in cold storage,” and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.

It's an interesting read, and it also <shamless plug>cites my bioGUID project</shamless plug>.

[Image from dave 7]