Friday, March 26, 2010

TreeBASE II has been released


The TreeBASE team have announced that TreeBASE II has been released. I've put part of the announcement on the SSB web site. Given that TreeBASE and I have history, I think it best to keep quiet and see what others think before blogging about it in detail. Plus, there's a lot of new features to explore. Take it for a spin and see what you think.

Thursday, March 25, 2010

Wiki frustration

Yesterday I fired off a stream of tweets, starting with:
OK, I'm back to hatin' on the wikis. It's just way to hard to do useful queries (e.g., anything that requires a path linking some entities) [10991434609]

Various people commented on this, either on twitter or in emails, e.g.:
@rdmpage hatin' wikis is irrational - all methods have pros and cons, so wikis are an essential resource among others [@stho002]

So, to clarify, I'm not abandoning wikis. I'm just frustrated with the limitations of Semantic Mediawiki (SMW). Now, SMW is a great piece of software with some cool features. For example,

  1. Storing data in Mediawiki templates (i.e., key-value pairs) makes it rather like a JSON database with version control (shades of CouchDB et al.).

  2. Having Mediawiki underlying SMW makes lots of nice editing and user management features available for free.

  3. The template language makes it relatively easy to format pages.

  4. It is easy to create forms for data entry, so users aren't confronted with editing Mediawiki templates unless they want to.

  5. Supports basic queries.


It's the last item on the list that is causing me grief. The queries are, well, basic. What SMW excels at are queries connecting one page to another. For example, if I create wiki pages for publications, and list the author of each publication, the page for an author can contain a simple query (of the form {{ #ask: [[maker::Author 1]] | }}) that lists all the publications of that author:

q1.png


That's great, and many of the things are want to do can be expressed in this simple way (i.e., find all pages of a certain kind that link to the current page). It's when I want to go more than one page away that things start to go pear shaped. For example, for a given taxon I want to display a map of where it is found, based things like geoferenced GenBank sequences, or sequences from georeferenced museum specimens.

q2.png
This can involve a path between several entities ("pages" in SMW), and this just doesn't seem to be possible. I've managed to get some queries working via baroque Mediawiki templates, but it's getting to the point where hours seem to be wasted trying to figure this stuff out.

So, what tends to happen is I develop something, hit this brick wall, then go and do something else. I suspect that one way forward is to use SMW as the tool to edit data and display basic links, then use another tool to do the deeper queries. This is a bit like what I'm exploring with BioStor, where I've written my own interface and queries over a simple MySQL database, but I'm looking into SMW + Wikisource to handle annotation.

This leaves the question of how to move forward with http://iphylo.org/treebase/? One approach would be to harvest the SMW pages regularly (e.g., by consuming the RSS feed, and either pulling off SMW RDF or parsing the template source code for the pages), use this to populate a database (say, a key-value store or a triple store) where more sophisticated queries can be developed. I guess one could either make this a separate interface, or develop it as a SMW extension, so results could be viewed within the wiki pages. Both approaches have merits. Having a complete separate tool that harvests the phylogeny wiki seems attractive, and in many ways is an obvious direction for iSpecies to take. Imagine iSpecies rebuilt as an RDF aggregator, where all manner of data about a taxon (or sequences, or specimen, or publication, or person) could be displayed in one place, but the task of data cleaning took place elsewhere.

Food for thought. And, given that it seems some people are wondering why on earth I bother with this stuff, and can't I just finish TreeView X, I could always go back to fussing with C++…

Wednesday, March 24, 2010

DjVu XML to HTML

This post is simply a quick note on some experiments with DjVu that I haven't finished. Much of BHL's content is available as DjVu files, which contain both the scanned images and OCR text, complete with co-ordinates of each piece of text. This means that it would, in principle, be trivial to lay out the bounding boxes of each text element on a web page. Reasons for doing this include:

  1. To support Chris Freeland's Holy Grail of Digital Legacy Taxonomic Literature, where user can select text overlaid on BHL scan image.

  2. Developing a DjVu viewer along the lines of Google's very clever Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?).

  3. Highlighting search results on a BHL page image (by highlighting the boxes containing terms the user was searching for).


As an example, here is a BHL page image:



and here's the bounding boxes of the text recognised by OCR overlain on the page image:





























































































































































































































































and here's the bounding boxes of the text recognised by OCR without the page image:































































































































































































































































The HTML is generated using a XSL transformation that take two parameters, an image name and a scale factor, where 1.0 generates HTML at the same size as the original image (which may be rather large). The view above were generated with a scale of 0.1. The XSL is here:


<?xml version='1.0' encoding='utf-8'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="html" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:param name="scale"/>
<xsl:param name="image"/>

<xsl:template match="/">
<xsl:apply-templates select="//OBJECT"/>
</xsl:template>

<xsl:template match="//OBJECT">
<div>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>position:relative;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>

<img>
<xsl:attribute name="src">
<xsl:value-of select="$image"/>
</xsl:attribute>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>margin:0px;padding:0px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>
</img>

<xsl:apply-templates select="//WORD"/>

</div>
</xsl:template>

<xsl:template match="//WORD">
<div>
<xsl:attribute name="style">
<xsl:text>position:absolute;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:variable name="coords" select="@coords"/>
<xsl:variable name="minx" select="substring-before($coords,',')"/>
<xsl:variable name="afterminx" select="substring-after($coords,',')"/>
<xsl:variable name="maxy" select="substring-before($afterminx,',')"/>
<xsl:variable name="aftermaxy" select="substring-after($afterminx,',')"/>
<xsl:variable name="maxx" select="substring-before($aftermaxy,',')"/>
<xsl:variable name="aftermaxx" select="substring-after($aftermaxy,',')"/>
<xsl:variable name="miny" select="substring-after($aftermaxy,',')"/>

<xsl:text>left:</xsl:text>
<xsl:value-of select="$minx * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="($maxx - $minx) * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>top:</xsl:text>
<xsl:value-of select="$miny * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="($maxy - $miny) * $scale"/>
<xsl:text>px;</xsl:text>

</xsl:attribute>

<!-- actual text -->
<!-- <xsl:value-of select="." /> -->
</div>
</xsl:template>

</xsl:stylesheet>


TreeView and Windows Vista


Continuing the theme of ancient programs of mine still being used, I've been getting reports that the Windows version of TreeView won't install on Windows Vista and Windows 7. As with NDE, it's the installer that seems to be causing the problem. I've put a new installer on the TreeView web page (direct link here).

TreeView still seems to be used quite a bit, judging from responses to my question on BioStar, even through there are more modern alternatives. I made a brief attempt to create a replacement, namely TreeView X, but it lacks much of the functionality of the original. I keep meaning to revisit TreeView development, as it's been very good to me.

Tuesday, March 23, 2010

BHL Tech Report

Chris Freeland tweeted about the presentation he gave to a recent BHL meeting, the slides for which are reproduced below:


It makes interesting viewing. From my own (highly biased) perspective slide 10 is especially interesting, in that BioStor, my tool for locating articles in BHL is the 8th largest source of traffic for BHL - 2.46% of visitors to BHL come via BioStor. By way of comparison, the largest single referrer for the same period is Tropics (11.14%), with EOL third at 6.45% and Wikipedia fourth at 5.56%. Obviously this is a small sample, but it is vaguely encouraging. It is nice to see Wikipedia being a significant source of links. EOL has dropped significantly from being the largest referrer (22.51%) in 2008-2009 to third (6.45%) for 2010 to date. In one sense this may reflect BHL achieving greater visibility independently of EOL. I also wonder whether it might reflect the unsatisfactory way EOL displays BHL results? If the BHL results were grouped (for example, by article), then I suspect they may provide more enticing links for EOL users to click on (especially, for example, if the articles were flagged by whether they included the original description of a taxon). It would be fun to know what EOL users actually do (e.g., what fraction of EOL visitors to a given page click on BHL links?).

Friday, March 19, 2010

Where next for BHL?

You can't just ask customers what they want and then try to give that to them. By the time you get it built, they'll want something new. - Steve Jobs

It's Friday, so time for either a folly or a rant. BHL have put another user survey into the field http://www.surveymonkey.com/s/BHLsurvey. I loathe user surveys. They don't ask the questions I would ask, then when you see the results, often the most interesting suggestions are ignored (see the Evaluation of the User Requirement Survey Oct-Nov 2009). And we've been here before, with EDIT (see this TAXACOM message about the moribund Virtual Taxonomic Library). Why go to the trouble of asking users if you aren't going to deliver?

I suspect surveys exist not to genuinely help figure out what to do, but as an internal organisational tool to convince programmers what needs to be done, especially in large, multinational consortia where the programmers might be in a different institution, and don't have any particular vested interest in the project (if they did, they wouldn't need user surveys, they'd be too busy making stuff to change the world).

So, what should BHL be doing? There's lots of things to do, but for me the core challenges are findability and linkage. BHL needs to make its content more findable, both in terms of bibliographic metadata and search terms (e.g., taxa, geographic places). It also needs to be much more strongly linked, both internally (e.g., cross referencing between articles where one BHL article cites another BHL article), and externally (to the non-BHL literature, for example, and to nomenclators), and the external links need to be reciprocal (BHL should link to nomenclators, and nomenclators should point back to BHL).

There are immediate benefits from improved linkage. Users could navigate within BHL content by citation links, for example, in the same way we can in the recent literature. If BHL cleaned up its metadata and had a robust article-level OpenURL resolver it could offer services to publishers to add additional links to their content, driving traffic to BHL itself. Better findability leads to better links.

One major impediment to improving things is the quality of the OCR text extracted from BHL scans. There have been various automated attempts to extract metadata from OCR scans (e.g., "A metadata generation system for scanned scientific volumes" doi:10.1145/1378889.1378918), but these have met with mixed success. There's a lot of scope for improving this, but I suspect a series of grad student theses on this topic may not be the way forward (grad students rarely go all the way and develop something that can be deployed). Which leaves crowd sourcing. Given the tools already available for correcting Internet Archive-derived book scans (e.g., Wikisource discussed in an earlier post), it seems to me the logical next move for BHL is to dump all their content into a Wikisource-style environment, polish the tools and interface a bit, and encourage the community to have at it. Forming and nurturing that community will be a challenge, but providing BHL can demonstrate some clear benefits (e.g., generating clean pages with new taxon names, annotated illustrations, OpenURL tools for publishers to use), then I think the task isn't insurmountable. It just needs some creativity (e.g., why not engage EOL users who land on BHL content to go one step further and clean it up, or link with Wikipedia and Wikispecies to attract users interested in actively contributing?).

I doubt any of this will be in any user survey...

Tuesday, March 16, 2010

Progress on a phylogeny wiki

I've made some progress on a wiki of phylogenies. Still much to do, but here are some samples of what I'm trying to do.

First up, here's an example of a publication http://iphylo.org/treebase/Doi:10.1016/j.ympev.2008.02.021:
wiki1.png

In addition to basic bibiographic details we have links to GenBank sequences and a phylogeny. The sequences are georeferenced, which enables us to generate a map. At a glance we see that the study area is central America.

This study published the following tree:
wiki2.png

The tree is displayed using my tvwidget. A key task in constructing the wiki is mapping labels used in TreeBASE to other taxonomic names, for example, those in the NCBI taxonomy database. This is something I first started working on in the TbMap project (doi:10.1186/1471-2105-8-158). In the context of this wiki I'm explicitly mapping TreeBASE taxa to NCBI taxa. Taxa are modelled as "namestrings" (simple text strings), OTUs (TreeBASE taxa), and taxonomic concepts (sets of observations or taxa). For example, the tree shown above has numerous samples of the frog Pristimantis ridens, each with a unique label (namestring) that includes, in this case, voucher specimen information (e.g., "Pristimantis ridens La Selva AJC0522 CR"). Each of these labels is mapped to the NCBI taxon Pristimantis ridens.

One thing I'm interested in doing is annotating the tree. Eventually I hope to generate (or make it easy to generate) things such as Google Earth phylogenies (via georeferenced sequences and specimens). For now I'm playing with generating nicer labels for the terminal taxa. As it stands if you download the original tree from TreeBASE you have the original study-specific labels (e.g., "Pristimantis ridens La Selva AJC0522 CR"), whereas it would be nice to also have taxonomic names (for example, if you wanted to combine the tree or data with another study). Below the tree you'll see a NEXUS NOTES block with the "ALTTAXNAMES" command. The program Mesquite can use this command to enable users to toggle between different labels, so that you can have either a tree like this:
wiki3.png

or a tree like this:
wiki4.png

Monday, March 15, 2010

How Wikipedia can help scope a project

I'm revisiting the idea of building a wiki of phylogenies using Semantic Mediawiki. One problem with a project like this is that it can rapidly explode. Phylogenies have taxa, which have characters, nucleotides sequences and other genomics data, and names, and come from geographic locations, and are collected and described by people, who may deposit samples in museums, and also write papers, which are published in journals, and so on. Pretty soon, any decent model of a phylogeny database is connected to pretty much anything of interest in the biological sciences. So we have a problem of scope. At what point do we stop adding things to the database model?

It seems to me that Wikipedia can help. Once we hit a topic that exists in Wikipedia, then we can stop. It's a reasonable bet that either now, or at some point in the future, the Wikipedia page is likely to be as good as, or better than, anything a single project could do. Hence, there's probably not much point storing lots of information about genes, countries, geographic regions, people, journals, or even taxa, as Wikipedia has these. This means we can focus on gluing together the core bits of a phylogenetic study (trees, taxa, data, specimens, publications) and then link these to Wikipedia.

In a sense this is a variation on the ideas explored in EOL, the BBC, and Wikipedia, but in developing my wiki of phylogenies project (this is the third iteration of this project) it's struck me how the question "is this in Wikipedia?" is the quickest way to answer the question "should I add x to my wiki?" Hence, Wikipedia becomes an antidote to feature bloat, and helps define the scope of a project more clearly.

Wednesday, March 03, 2010

Setting up a local Wikisource

A little while ago I came across Wikisource, and it dawned on me that this is a model for BHL. To quote from the Wikisource web site:
Wikisource is an online library of free content publications, collected and maintained by our community. We now have 140,596 texts in the English language library. See our inclusion policy and help pages for information on getting started, and the community portal for ways you can contribute. Feel free to ask questions on the community discussion page, and to experiment in the sandbox.

Much of their content comes from the Internet Archive (as does BHL's), and Wikisource have developed extensions for Mediaiwki to do some cool things, such as extract text and images from DjVu files. If you haven't come across DjVu before, it's a format designed to store scanned documents, and comes with some powerful open source tools for extracting images and OCR text. Wikisource can take a DjVu file, extract images, thumbnails and text, creating side-by-side displays where users can edit and correct OCR text:

wikisource.png


So, like a fool, I decided to try and install some of these tools locally and see if I could do the same for some BHL content. That's when the "fun" started. Normally Mediawiki is pretty easy to set up. There are a few wrinkles because my servers live behind an institutional HTTP proxy, so I often need to tweak some code (such as the OpenID extension, which also needs a fix for PHP 5.3), but installing the extensions that underlie Wikisource wasn't quite so straightforward.

DjVu
djvu.png
The first step is supporting DjVu files in Mediawiki. This seems straightforward (see How to use DjVu with MediaWiki). First off you need the DjVu tools. I use Mac OS X, so I get these automatically if I install DjView. The tools reside in Applications/DjView.app/Contents/bin (you can see this folder if you Control+click on the DjView icon and choose "Show Package Contents"), so adding this path to the name of each tool DjVu tool Mediaiwiki needs takes care of that.

But I also need NetPbm, and now the pain starts. NetPbm won't build on Mac OS X, at least not out of the box on Snow Leopard. It makes assumptions about Unix that Mac OS X doesn't satisfy. After some compiler error messages concerning missing variables that I eventually traced to signal.h I gave up and installed MacPorts, which has a working version of NetPbm. MacPorts installed fine, but it's a pain having multiple copies of the same tools, one in /usr/local, and one in /opt/local.

OK, now we can display DjVu files in Mediawiki. It's small victories like this which leads to over confidence...

Proofread Page
Next comes the Proofread Page extension, which provides the editing functionality. This seemed fairly straightforward, although the documentation referred to a SQL file (ProofreadPage.sql) that doesn't seem to exist. More worringly, the documentation also says:
If you want to install it on your own wiki, you will need to install a 404 handler for generating thumbnails, such as WebStore.

This seems fine, except the page for WebStore states:
The WebStore extension is needed by the ProofreadPage extension. Unfortunately, documentation seems to be missing completely. Please add anything you know about this extension here.

Then there are the numerous statements "doesn't work" scattered through the page. So, I installed the extension and hoped for the best. It didn't work. As in, really, really didn't work. It took an hour or so of potentially fatal levels of blood pressure-inducing frustration to get to the bottom of this.

WebStore
Now, Webstore is a clever idea. Basically, the Proofread Page extension will need thumbnails of images in potentially varying sizes, and creates a link to the image it wants. Since that image doesn't exist on the web site the web server returns 404 Not Found, which normally results in a page like this. Instead, we tell the web server (Apache) that WebStore will handle 404's. If the request is for an image, Webstore creates the image file, streams it to the web browser, then deletes the file from disk. Essentially WebStore creates a web server for images (Webdot uses much the same trick, but without the 404 handler). Debugging a web server called by another web server is tricky (at least for a clumsy programmer like me), but by hacking the Webstore code (and switching on Mediawiki debug logging) I managed to figure out that Webstore seemed to be fetching and streaming the images fine, but they simply didn't appear in the wiki page (I got the little broken image icon instead). I tried alternative ways of dumping the image file to output, adding HTTP headers, all manner of things. Eventually (by accident, no idea how it happened) I managed to get an image URL to display in the Chrome web browser, but it wasn't an image(!) -- instead I got a PHP warning about two methods in the class DjVuHandler (mustRender and isMultiPage) not being consistent with the class they inherit from. WTF?! Eventually I found the relevant file (DjVu.php in includes/media in the Mediawiki folder), added the parameter $file to both methods, and suddenly everything works. At this point I didn't know whether to laugh or cry.

OCR text
There are some issues with OCR text from Internet Archive DjVu files. There are some extraneous characters (new lines, etc.) that I need to filter, and I'll probably have to deal with hyphenation. It looks fairly straightforward to edit the proofing extension code to handle these situations.

Semantic Mediawiki
Having got the proofing extension working, I then wanted to install the Semantic Mediawiki extensions so that I could support basic inference on the wiki. I approached this with some trepidation as there are issues with Mediawiki namespaces, but everything played nice and so far things seem to be working. Now I can explore whether I can combine the proofing tools from Wikisource with the code I've developed for iTaxon.

BioStor
So, having got something working, the plan is to integrate this with BioStor. One model I like is the video site Metacafe. For each video Metacafe has a custom web page(e.g., http://www.metacafe.com/watch/4137093) with an Edit Video Details link that takes you to a Semantic Mediawiki page where you can edit metadata for the video. I envisage doing something similar for BioStor, where my existing code provides a simple view of an article (perhaps with some nice visualisations), with a link to the corresponding wiki page where you can edit the metadata, and correct the OCR text.

Lessons
In the end I got there, although it was a struggle. Mediawiki is a huge, complicated bit of software, and is also part of a larger ecosystem of extensions, so it has enormous power. But there are lots of times when I think it would be easier if I wrote something to replicate the bit of functionality that I want. For example, side-by-side display of text and images would be straightforward to do. But once you start to think about supporting mark-up, user authentication, recording edit history, etc., the idea of using tools others have developed becomes more attractive. And the code is open source, which means if it doesn't work there's a fighting chance I can figure out why, and maybe fix it. It often feels harder than it should be, but I'll find out in the next few days whether yesterday's exertions were worth it.

Wikipedia manuscript

npre20104242-1.thumb.pngI've written up some thoughts on Wikipedia for a short invited review to appear (pending review) in Organisms, Environment, and Diversity (ISSN 1439-6092). The manuscript, entitled "Wikipedia as an encyclopaedia of life" is available as a preprint from Nature Precedings (hdl:10101/npre.2010.4242.1). The opening paragraph is:
In his 2003 essay E O Wilson outlined his vision for an "encyclopaedia of life" comprising "an electronic page for each species of organism on Earth", each page containing "the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits." Although the "quiet revolution” in biodiversity informatics has generated numerous online resources, including some directly inspired by Wilson’s essay (e.g., http://ispecies.org, http://www.eol.org), we are still some way from the goal of having available online all relevant information about a species, such as its taxonomy, evolutionary history, genomics, morphology, ecology, and behaviour. While the biodiversity community has been developing a plethora of databases, some with overlapping goals and duplicated content, Wikipedia has been slowly growing to the point where it now has over 100,000 pages on biological taxa. My goal in this essay is to explore the idea that, largely independent of the efforts of biodiversity informatics and well-funded international efforts, Wikipedia (http://en.wikipedia.org/wiki/Main_Page) has emerged as potentially the best platform for fulfilling E O Wilson’s vision.

The content will be familiar to readers of this blog, although the essay is perhaps a slightly more sober assessment of Wikipedia than some of my blog posts would suggest. It was also the first manuscript I'd written in MS Word for a while (not a fun experience), and the first ever for which I'd used Zotero to manage the bibliography (which worked surprisingly well).