Wednesday, March 03, 2010

Setting up a local Wikisource

A little while ago I came across Wikisource, and it dawned on me that this is a model for BHL. To quote from the Wikisource web site:
Wikisource is an online library of free content publications, collected and maintained by our community. We now have 140,596 texts in the English language library. See our inclusion policy and help pages for information on getting started, and the community portal for ways you can contribute. Feel free to ask questions on the community discussion page, and to experiment in the sandbox.

Much of their content comes from the Internet Archive (as does BHL's), and Wikisource have developed extensions for Mediaiwki to do some cool things, such as extract text and images from DjVu files. If you haven't come across DjVu before, it's a format designed to store scanned documents, and comes with some powerful open source tools for extracting images and OCR text. Wikisource can take a DjVu file, extract images, thumbnails and text, creating side-by-side displays where users can edit and correct OCR text:


So, like a fool, I decided to try and install some of these tools locally and see if I could do the same for some BHL content. That's when the "fun" started. Normally Mediawiki is pretty easy to set up. There are a few wrinkles because my servers live behind an institutional HTTP proxy, so I often need to tweak some code (such as the OpenID extension, which also needs a fix for PHP 5.3), but installing the extensions that underlie Wikisource wasn't quite so straightforward.

The first step is supporting DjVu files in Mediawiki. This seems straightforward (see How to use DjVu with MediaWiki). First off you need the DjVu tools. I use Mac OS X, so I get these automatically if I install DjView. The tools reside in Applications/ (you can see this folder if you Control+click on the DjView icon and choose "Show Package Contents"), so adding this path to the name of each tool DjVu tool Mediaiwiki needs takes care of that.

But I also need NetPbm, and now the pain starts. NetPbm won't build on Mac OS X, at least not out of the box on Snow Leopard. It makes assumptions about Unix that Mac OS X doesn't satisfy. After some compiler error messages concerning missing variables that I eventually traced to signal.h I gave up and installed MacPorts, which has a working version of NetPbm. MacPorts installed fine, but it's a pain having multiple copies of the same tools, one in /usr/local, and one in /opt/local.

OK, now we can display DjVu files in Mediawiki. It's small victories like this which leads to over confidence...

Proofread Page
Next comes the Proofread Page extension, which provides the editing functionality. This seemed fairly straightforward, although the documentation referred to a SQL file (ProofreadPage.sql) that doesn't seem to exist. More worringly, the documentation also says:
If you want to install it on your own wiki, you will need to install a 404 handler for generating thumbnails, such as WebStore.

This seems fine, except the page for WebStore states:
The WebStore extension is needed by the ProofreadPage extension. Unfortunately, documentation seems to be missing completely. Please add anything you know about this extension here.

Then there are the numerous statements "doesn't work" scattered through the page. So, I installed the extension and hoped for the best. It didn't work. As in, really, really didn't work. It took an hour or so of potentially fatal levels of blood pressure-inducing frustration to get to the bottom of this.

Now, Webstore is a clever idea. Basically, the Proofread Page extension will need thumbnails of images in potentially varying sizes, and creates a link to the image it wants. Since that image doesn't exist on the web site the web server returns 404 Not Found, which normally results in a page like this. Instead, we tell the web server (Apache) that WebStore will handle 404's. If the request is for an image, Webstore creates the image file, streams it to the web browser, then deletes the file from disk. Essentially WebStore creates a web server for images (Webdot uses much the same trick, but without the 404 handler). Debugging a web server called by another web server is tricky (at least for a clumsy programmer like me), but by hacking the Webstore code (and switching on Mediawiki debug logging) I managed to figure out that Webstore seemed to be fetching and streaming the images fine, but they simply didn't appear in the wiki page (I got the little broken image icon instead). I tried alternative ways of dumping the image file to output, adding HTTP headers, all manner of things. Eventually (by accident, no idea how it happened) I managed to get an image URL to display in the Chrome web browser, but it wasn't an image(!) -- instead I got a PHP warning about two methods in the class DjVuHandler (mustRender and isMultiPage) not being consistent with the class they inherit from. WTF?! Eventually I found the relevant file (DjVu.php in includes/media in the Mediawiki folder), added the parameter $file to both methods, and suddenly everything works. At this point I didn't know whether to laugh or cry.

OCR text
There are some issues with OCR text from Internet Archive DjVu files. There are some extraneous characters (new lines, etc.) that I need to filter, and I'll probably have to deal with hyphenation. It looks fairly straightforward to edit the proofing extension code to handle these situations.

Semantic Mediawiki
Having got the proofing extension working, I then wanted to install the Semantic Mediawiki extensions so that I could support basic inference on the wiki. I approached this with some trepidation as there are issues with Mediawiki namespaces, but everything played nice and so far things seem to be working. Now I can explore whether I can combine the proofing tools from Wikisource with the code I've developed for iTaxon.

So, having got something working, the plan is to integrate this with BioStor. One model I like is the video site Metacafe. For each video Metacafe has a custom web page(e.g., with an Edit Video Details link that takes you to a Semantic Mediawiki page where you can edit metadata for the video. I envisage doing something similar for BioStor, where my existing code provides a simple view of an article (perhaps with some nice visualisations), with a link to the corresponding wiki page where you can edit the metadata, and correct the OCR text.

In the end I got there, although it was a struggle. Mediawiki is a huge, complicated bit of software, and is also part of a larger ecosystem of extensions, so it has enormous power. But there are lots of times when I think it would be easier if I wrote something to replicate the bit of functionality that I want. For example, side-by-side display of text and images would be straightforward to do. But once you start to think about supporting mark-up, user authentication, recording edit history, etc., the idea of using tools others have developed becomes more attractive. And the code is open source, which means if it doesn't work there's a fighting chance I can figure out why, and maybe fix it. It often feels harder than it should be, but I'll find out in the next few days whether yesterday's exertions were worth it.