iPhylo: September 2010

Roderic D. M. Page

Friday, September 24, 2010

Mendeley Connect

When I first launched BioStor (an article finding tool built on the top of the (Biodiversity heritage Library) I wanted people to be able to edit metadata and add references, but also minimise the chances that junk would get added. As a quick and dirty deterrent I used reCAPTCHA, so anybody adding a reference or editing the metadata had to pass a CAPTHCA before their edits were accepted.

While reCAPTCHA does the trick, it can be tedious for somebody editing a lot of articles to have to pass a CAPTHCA every time they edit an article. Ed Baker of the International Commission on Zoological Nomenclature (ICZN) has a project to identify all the articles in the Bulletin of Zoological Nomenclature, and has been gently bugging me to add a login feature to BioStor. I played for a while with OpenID, but it occurred to me that Mendeley might be a more sensible strategy. Mendeley's API supports OAuth, a protocol where you can grant an application access to another application, but without giving away any passwords. It's used by Twitter and Facebook, among others. Indeed, a growing number of sites on the web are using Twitter and/or Facebook services to enable users to log in, rather than write their own code to support login, usernames, passwords, etc.

In the case of BioStor, I've added a link to sign in via Mendeley. if you click on it you get taken to a page like this:

If you're happy for BioStor to connect to Mendeley, you click on Accept and BioStor won't bug you to fill in a CAPTCHA. Once Mendeley's API matures it would be nice to add features such as the ability to add a reference in BioStor straight to your Mendeley library (this is doable now, but the Mendeley API looses some key metadata such as page numbers).

But, thinking more broadly, Mendeley has an opportunity here to provide services similar to Facebook Connect. For example, instead of simply having buttons on web pages to bookmark papers, we could have buttons indicating how many people had added a paper to their library, and whether any of those people were in your contacts. We could extend this further an create something like Facebook's Open Graph Protocol, which supports the "Like" button. Or perhaps, we could have an app that integrates with Facebook and harvests your "Likes" that are papers.

Food for thought. Meantime, I hope users like Ed will find BioStor less tedious to use now that they can log in via Mendeley.

Wednesday, September 22, 2010

GeoCouch

@mikeal a little tedious. you can take OSM and then convert it to SHP and then http://github.com/maxogden/shp2geocouchless than a minute ago via webmax ogden
maxogden

The tweet above inspired me to take a quick look at GeoCouch, a version of CouchDB that supports spatial queries. This is something I need if I'm going to start playing seriously with CouchDB. So, it was off to Installing and working with GeoCouch, grabbing a copy of HomeBrew (yet another package manager for Mac OS X), in the hope of installing GeoCouch. Things went fairly smoothly, although it took what seemed like an age to build everything. But I now have GeoCouch running. Previously I'd been running CouchDB using http://janl.github.com/couchdbx/, which launches vanilla CouchDB. However, if you launch CouchDBX after starting GeoCouch from the command line, CouchDBX is talking to GeoCouch.

I then grabbed shp2geocouch to try some shape files (I grabbed some shape files from the IUCN to play with). If you're on a Mac grab GISLook to get Quick Look previews of these files. Since I'm new to ruby there were a couple of gotchas, such as lacking some prerequisites (httparty and couchrest, both installed by typing gem install <name of package>), and there was the small matter of needing to add ~/.gem/ruby/1.8/bin to my path so I could find shp2geocouch (spot the ruby neophyte). The shape file didn't get processed completely, but at least I managed to get some data into GeoCouch.

So far I've been playing with the examples at http://github.com/vmx/couchdb, and things seem to work. At least, the basic bounding box queries work. I'm tempted to play with this some more (and get my head arounbd GeoJSON), perhaps trying to recreate the functionality of my Elsevier Challenge entry, for which I wrote a custom key-value database that was awfully clunky.

Finding scientific articles in a large digital archive: BioStor and the Biodiversity Heritage Library

Yesterday I uploaded a manuscript to Nature Precedings that describes the inner workings of BioStor. The title is "Finding scientific articles in a large digital archive: BioStor and the Biodiversity Heritage Library", and you can grab it here: hdl:10101/npre.2010.4928.1.

Manuscripts describing databases are usually pretty turgid affairs, and this isn't an exception, despite my attempts to spice it up with the tale of ~~Leviathan~~, oops, Livyatan (see doi:10.1038/nature09381 and Wikipedia). Plus, I can't escape the thought that BioStor would have been a lot more fun to write if I'd used a key-value database like CouchDB. I fear this is often the way of things. By the time it comes to writing something up, you realise that if you could start over you'd do it rather differently.

Monday, September 13, 2010

BHL and the iPad

@elyw I'd leave bookmarking to 3rd party, e.g. Mendeley. #bhlib specific issues incl. displaying DjVu files, and highlighting taxon namesless than a minute ago via Tweetie for MacRoderic Page
rdmpage

Quick mock-up of a possible BHL iPad app (made using OmniGraffle), showing a paper from BioStor(http://biostor.org/reference/50335). Idea is to display a scanned page at a time, with taxonomic names on page being clickable (for example, user might get a list of other BHL content for this name). To enable quick navigation all the pages in the document being viewed are displayed in a scrollable gallery below main page.

Key to making this happen is being able to display DjVu files in a sensible way, maybe building on DjVu XML to HTML. Because BHL content is scanned, it makes sense to treat content as pages. We could extract OCR text and display that as a continuous block of text, but the OCR is sometimes pretty poor, and we'd also have to parse the text and interpret its structure (e.g., this is the title, these are section headings, etc.), and that's going to be hard work.

Friday, September 10, 2010

Touching citations on the iPad

Quick demo of the mockup I alluded to in the previous post. Here's a screen shot of the article "PhyloExplorer: a web server to validate, explore and query phylogenetic trees" (doi:10.1186/1471-2148-9-108) as displayed as a web-app on the iPad. You can view this at http://iphylo.org/~rpage/ipad/touch/ (you don't need an iPad, although it does work rather better on one).

I've taken the XML for the article, and redisplayed it as HTML, with (most) of the citations highlighted in blue. If you touch one (or click on it if you're using a desktop browser) then you'll see a popover with some basic bibliographic details. For some papers which are Open Access I've extracted thumbnails of the figures, such as for "PhyloFinder: an intelligent search engine for phylogenetic tree databases" (doi:10.1186/1471-2148-8-90), shown above (and in more detail below):

The idea is to give the reader a sense of what the paper is about, beyond can be gleaned from just the title and authors. The idea was inspired by the Biotext search engine from Marti Hearst's group, as well as Elsevier's "graphical abstract" noted by Alex Wild (@Myrmecos).

Here's a quick screencast showing it "live":

iPad citation popover from Roderic Page on Vimeo.

The next step is to enable the reader to then go and read this paper within the iPad web-app (doh!), which is fairly trivial to do, but it's Friday and I'm already late...

CouchDB, Mendeley, and what I really want in an iPad article viewer

Playing with @couchdb, starting to think of the Mendeley API as a read/write JSON store, and having a reader app built on that...less than a minute ago via Tweetie for MacRoderic Page
rdmpage

It's slowly dawning on me that many of the ingredients for an alternative different way to browse scientific articles may already be in place. After my first crude efforts at what an iPad reader might look like I've started afresh with a new attempt, based on the Sencha Touch framework. The goal here isn't to make a polished app, but rather to get a sense of what could be done.

The first goal is to be able to browse the literature as if it was a connected series of documents (which is what, of course, it is). This requires taking the full text of an article, extracting the citations, and making them links to further documents (also with their citations extracted, and so on). Leaving aside the obvious problem that this approach is limited to open access articles, an app that does this is going to have to store a lot of bibliographic data as the reader browses the literature (otherwise we going to have to do all the processing on the fly, and that's not going to be fast enough). So, we need some storage.

MySQL
One option is to write a MySQL database to hold articles, books, etc. Doable (I've done more of these than I care to remember), but things get messy pretty quickly, especially as you add functionality (tags, fulltext, figures, etc.).

RDF
Another option is to use RDF and a tripe store. I've played with linked data quite a bit lately (see previous "Friday follies" here and here), and I thought that a triple store would be a great way support an article browser (especially as we add additional kinds of data, such as sequences, specimens, phylogenies, etc.). But linked data is a mess. For the things I care about there are either no canonical identifiers, or too many, and rarely does the primary data provider served linked data compliant URLs (e.g., NCBI), hence we end up with a plethora of wrappers around these sources. Then there's the issue of what vocabularies to use (once again, there are either none, or too many). As a query language SPARQL isn't great, and don't even get me started on the issue of editing data. OK, so I get the whole idea of linked data, it's just that the overhead of getting anything done seems too high. You've got to get a lot of ducks to line up.

CounchDB

So, I started playing with CounchDB, in a fairly idle way. I'd had a look before, but didn't really get my head around the very different way of querying a database that CouchDB requires. Despite this learning curve, CouchDB has some great features. It stores documents in JSON, which makes it trivial to add data as objects (instead of mucking around with breaking them up into tables for SQL, or atomising them into triples for RDF), it supports versioning right out of the box (vital because metadata is often wrong and needs to be tidied up), and you talk to it using HTTP, which means no middleware to get in the way. You just point your browser (or curl, or whatever HTTP tool you have) and send GET, POST, PUT, or DELETE commands. And now it's in the cloud.

In some ways ending up with CouchDB (or something similar) seems inevitable. The one "semantic web" tool that I've made most use of is Semantic MediaWiki, which powers the NCBI to Wikipedia mapping I created in June. Semantic Mediawiki has it's uses, but occasionally it has driven me to distraction. But, when you get down to it, Semantic Mediawiki is really just a versioned document store (where the documents are typically key-value pairs), over which have been laid a pretty limited query language and some RDF export features. Put like this, most of the huge Mediawiki engine underlying Semantic MediaWiki isn't needed, so why not cut to the chase and use a purpose-built versioned document store? Enter CouchDB.

Browsing and Mendeley
So, what I have in mind is a browser that crawls a document, extracting citations, and enabling the reader to explore those. Eventually it will also extract all the other chocolatey goodness in an article (sequences, specimens, taxonomic names, etc.), but for now I'm focussing on articles and citations. A browser would need to store article metadata (say, each time it encounters an article for the first time), as well as update existing metadata (by adding missing DOIs, PubMed ids, citations, etc.), so what easier way than as JSON in a document store such as CouchDB? This is what I'm exploring at the moment, but let's take a step back for a second.

The Mendeley API, as poorly developed as it is, could be treated as essentially a wrapper around a JSON document store (the API stores and returns JSON), and speaks HTTP. So, we could imagine a browser that crawls the Mendeley database, adding papers that aren't in Mendeley as it goes. The act of browsing and reading would actively contribute to the database. Of course, we could spin this around, and argue that a crawler + CouchDB could pretty effectively create a clone of Mendeley's database (albeit without the social networking features that come with have a large user community).

This is another reason why the current crop of iPad article viewers, Mendeley's included, are so disappointing. There's the potential to completely change the way we interact with the scientific literature (instead of passively consuming PDFs), and Mendeley is ideally positioned to support this. Yes, I realise that for the vast majority of people being able to manage their PDFs and format bibliographies in MS Word are the killer features, but, seriously, is that all we aspire too?

Friday, September 03, 2010

Viewing scientific articles on the iPad: browsing articles

In previous articles I've looked at how various apps display scientific articles. The apps I looked at were:

So, where next? As Ian Mulvany noted in a comment on an earlier post, I haven't attempted to summarise the best user interface metaphors for navigation. Rather than try and do that in the abstract, I'd like to create some prototypes to play with various ideas. The Sencha Touch framework looks a good place to start. It's web-based, so things can be prototyped rapidly (I'm not going to learn Objective C anytime soon). There's a moderately steep learning curve, unless you've written a lot of Javascript (I've done some, but not a lot), but it seems to offer a lot of functionality. Another advantage of developing a web app is that it keeps the focus on making the content accessible across devices, and using the web as the means to display and interact with content.

Then there is also the issue (in addition to displaying an individual article) of how to browse and find articles to view. Here are some possibilities.

Publisher's stream
Apps such as the Nature app and the PLos Reader provide you with a stream of articles from a single publisher. This is obviously a bit limiting for the reader, but might have some advantages if the publisher has specifically enhanced their content for devices such as the iPad.

Personal library
Apps such as Mendeley and Papers provide articles from your personal library. These are papers you care about, and one you may make active use of.

Social
Social readers such as Flipboard show the power of bringing together in one place content derived from social streams, such as Twitter and Facebook, as well as curated sources and publisher streams. Mendeley and other social bookmarking services (e.g., CiteULike, Connotea) could be used to provide social similar streams of papers for an article viewer. Here the goal is probably to find out what papers people you know find interesting.

Spatial

In an earlier post I used a map to explore papers in my BioStor archive. This would be an obvious thing to add to an iPad app, especially as the iPad knows where you are. Hence, you could imagine browsing papers about areas that are near you, or perhaps by authors near you. This would be useful if, say, you wanted to know about ecological or health studies of the area you live in. If the geographic search was for people rather than papers, you could easily discovering what kind of research is published by universities or other research bodies that are near your current location.

Of course, Earth is not the only thing we can explore spatially. Google maps can display other bodies in the solar system, (e.g., Mars), as well as the night sky. Imagine being interested in astronomy and being able to browse papers about specific planetary or stellar objects. Likewise, genomes can be browsed using Google maps-inspired browsers (e.g., jBrowse), so we could have an app where you could easily retrieve articles about a particular gene or other region of a genome.

Categories
Another way to browse content is by topic. Classifying knowledge into categories is somewhat fraught, but there are some obvious wasy this could be useful. A biologist might want to navigate content by taxonomic group, particularly if they want to browse through the 1000's of articles published in a journal such as Zootaxa (hence my experiments on browsing EOL). Of course, a tree is not the only way to navigate hierarchical content. Treemaps are another example, and I've played with various versions in the past (see here and here).

I have a love-hate relationship with treemaps, but some of the most interesting work I've seen on treemaps has been motivated by displaying information on small screens, e.g. "Using treemaps to visualize threaded discussion forums on PDAs" (doi:10.1145/1056808.1056915).

Summary
These notes list some of the more obvious ways to browse a collection of articles. It would be fun to explore these (and other approaches) in parallel with thinking about how to display the actual articles. These two issues are related, in the sense that the more metadata we can extract from the articles (such as keywords, taxonomic names and other named entities, geographic localities, etc.) the richer the possibilities for finding our way through those articles.