Showing posts with label javascript. Show all posts
Showing posts with label javascript. Show all posts

Sunday, August 20, 2017

Notes on displaying big trees using Google Maps/Leaflet

Notes to self on web map-style tree viewers. The basic idea is to use Google Maps or Leaflet to display a tree. Hence we need to compute tiles. One approach is to use a database that supports spatial queries to store the x,y coordinates of the tree. When we draw a tile we compute the coordinates of that tile, based on position and zoom level, do a spatial query to extract all lines that intersect with the rectangle for that tile, and draw those.

A nice example of this is Lifemap (see also De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. doi:10.1371/journal.pbio.2001624).

It occurs to me that for trees that aren't too big we could do this without an external database. For example, what if we used a Javascript implementation of an R-tree, such as imbcmdth/RTree or its fork leaflet-extras/RTree. So, we could compute the coordinates of the nodes in the tree in "geographic" space, store the bounding box for each line/arc in an R-tree, then query that R-tree for lines that intersect with the bounding box of the relevant tile. We could use a clipping algorithm to only draw the bits of the lines that cross the tile itself.

Web maps, at least in my experience, make trips to a tile server to fetch a tile, we would want instead to call a routine within our web page, because all the data would be loaded into that page. So we'd need to modify the tile creating code.

The ultimate goal would be to have a single page web app that accepts a Newick-style tree and converts into a browsable, zoomable visualisation.

Thursday, November 24, 2016

The Semantic Web made fun: d3sparql

Screenshot 2016 11 24 10 08 22

Continuing my on-again off-again relationship with the Semantic Web, I stumbled across a cool approach to visualising the results of SPARQL queries. Toshiaki Katayama (@tktym) has put together d3sparql, a set of Javascript scripts that takes SPARQL queries and formats the results graphically using D3.

For example, give the SPARQL endpoint http://togostanza.org/sparql, the following query retrieves the NCBI classification for the tardigrade family Hypsibiidae:

PREFIX rdfs: PREFIX up: SELECT ?root_name ?parent_name ?child_name FROM <http://togogenome.org/graph/uniprot> WHERE { VALUES ?root_name { "Hypsibiidae" } ?root up:scientificName ?root_name . ?child rdfs:subClassOf+ ?root . ?child rdfs:subClassOf ?parent . ?child up:scientificName ?child_name . ?parent up:scientificName ?parent_name . }

By outputting the results as a list of parent-child pairs, it is straightforward to convert the output of this query into a form that D3 accepts, so we can get a tree like this:

HypsibiidaeHebesuncusHebesuncus conjugensHebesuncus ryaniHebesuncus sp. Hebe_06_218Hebesuncus sp. Hebe_06_221DiphasconDiphascon sp. CJS-2007aDiphascon sp. CJS-2007bDiphascon cf. scoticum MC-2011Diphascon (Adropion) sp. MC-2011Diphascon maucciDiphascon puniceumDiphascon sp. Diph_06_114Diphascon sp. Diph_06_147Diphascon sp. Diph_07_008Diphascon sp. Diph_07_168Diphascon sp. Diph_07_169Diphascon sp. Diph_07_176Diphascon alpinumDiphascon sp. F6456Diphascon sp. F6457Diphascon sp. F6458Diphascon sp. F6459Diphascon sp. F6460Diphascon pingueDiphascon belgicaeDiphascon scoticumDiphascon higginsiDiphascon nodulosumDiphascon pataneiDiphascon ramazzottiiDiphascon sp. F7485Diphascon sp. Diph06_146Diphascon sp. Diph07_25Diphascon sp. Diph07_28Diphascon sp. Diph07_29Diphascon sp. Diph07_61Diphascon sp. Diph07_64AcutuncusAcutuncus antarcticusAcutuncus sp. PC-2013HypsibiusHypsibius cf. convergens 1 EK-2007Hypsibius klebelsbergiHypsibius scabropygusHypsibius cf. convergens 2 EK-2007Hypsibius dujardiniHypsibius sp. CJS-2008Hypsibius sp. 'Moon 1997'Hypsibius sp. F7889Hypsibius convergensHypsibius pallidusHypsibius cf. convergens MD-2013BorealibiusBorealibius zetlandicusThuliniusThulinius stephaniaeThulinius sp. JCR-2003Thulinius sp. DVL-2010Thulinius augustiIsohypsibiusIsohypsibius granuliferIsohypsibius cambrensisIsohypsibius asperIsohypsibius prosostomusIsohypsibius papilliferIsohypsibius sp. Tardi_OakIsohypsibius elegansIsohypsibius sp. Tar179Isohypsibius sp. Tar194Isohypsibius sp. Tar195Isohypsibius dastychiHalobiotusHalobiotus crispaeHalobiotus stenostomusRamazzottiusRamazzottius oberhaeuseriRamazzottius cf. oberhaeuseriRamazzottius sp. Rama_07_123Ramazzottius sp. F10349Ramazzottius sp. F10350Ramazzottius sp. F10470Ramazzottius sp. F10471Ramazzottius sp. F10472Ramazzottius sp. F10473Ramazzottius sp. F3679Ramazzottius sp. F3680Ramazzottius sp. F3681Ramazzottius sp. F3682Ramazzottius sp. F3683Ramazzottius sp. F6917Ramazzottius sp. F6918Ramazzottius sp. F6919Ramazzottius sp. F6920Ramazzottius sp. F6921Ramazzottius sp. F6922Ramazzottius varieornatusPseudobiotusPseudobiotus sp. SHR-2005Pseudobiotus kathmanaePseudobiotus megalonyxAstatumenAstatumen trinacriaeEremobiotusEremobiotus alicataiDoryphoribiusDoryphoribius flavusDoryphoribius macrodonItaquasconItaquascon placophorumMixibiusMixibius cf. saracenus MC-2011Mixibius saracenusPlaticristaPlaticrista angustata

The ability to quickly generate trees, charts, and maps from SPARQL queries makes things a lot easier. We can play around a little and explore things. The strength (and challenge) of SPARQL is that it is very open-ended, you can more or less develop queries to do anything. Being able to visualise the results will help guide that exploration.

The code for d3sparql is on GitHub. One "gotcha" is that the cached examples and external Javascript libraries aren't included. I've forked the repository here and added the missing files, so that if you grab that version it works straight out of the box.

Monday, March 31, 2014

Rethinking annotating biodiversity data

TL;DR By using bookmarklets and a central annotation store, we can build a system to annotate any biodiversity database, and display those annotations on those databases.

A couple of weeks ago I was at GBIF meeting in Copenhagen, and there was a discussion about adding a new feature to the GBIF portal. The conversation went something like this:

Advisor: "We really need this feature, now!"

Developer: "OK, but which of these other things you've told us we need to do should we stop doing, so we can add this new feature?"

Resources are limited, and adding new features to a project can be difficult. This got me thinking about the issue of annotating data, in GBIF and other biodiversity projects. There have been a number of recent papers on annotating biodiversity data, such as:

Morris, R. A., Dou, L., Hanken, J., Kelly, M., Lowery, D. B., Ludäscher, B., Macklin, J. A., et al. (2013, November 4). Semantic Annotation of Mutable Data. (I. N. Sarkar, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0076093
Tschöpe, O., Macklin, J. A., Morris, R. A., Suhrbier, L., & Berendsohn, W. G. (2013, December 20). Annotating biodiversity data via the Internet. Taxon. International Association for Plant Taxonomy (IAPT). doi:10.12705/626.4

It seems to me that these potentially suffer the assumption that data aggregators such as GBIF, and data providers such as natural history collections have sufficient resources in place to (a) implement such systems, and (b) process the annotations made by the community and update their records. What if neither assumption holds true?

Everyone is busy


Any system which requires a project to add another feature is going to have to compete with other priorities. I ran into this with my BioNames project, which was partly funded by EOL. BioNames links taxonomic names for animals (obtained from ION) to the primary literature, for example Pinnotheres atrinicola was published in the following paper:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904.

Ideally, all the links between names and publications that I'd assembled in BioNames would have been added to EOL, so that (wherever possible) users of EOL could see the original description of a taxon in EOL. But this didn't happen. In order to get BioNames into EOL I had to export the data in Darwin Core format, which is poorly suited to this kind of data. It also became clear that BioNames and EOL had rather different data models when it came to taxa, names, and publications. This meant it was going to be challenge providing the data in w ay that was usable by EOL. Plus, EOL was pretty busy doing other things such as developing TraitBank™ (yes, that's a "™" after TraitBank). So, I never did get BioNames content into EOL.

But there's another way to do this.

The Web means never having to ask for permission


It occurred to me (around about the time that I was at the pro-iBiosphere hackathon at Leiden) that there's another way to tackle this, a way which uses bookmarklets. Bookmarklets are little snippets of Javascript that can be stored as bookmarks in your web browser, and they can add extra functionality to an existing web page. You may well have come across these already, such as Save to Mendeley , or Altmetric it.

How does this help us with annotation? Well, with a little programming, you can add features that you think are "missing" from a web page, and you don't need to ask anyone's permission to do it. So, I could negotiate with EOL about how to get data from BioNames into EOL, or I can simply do this:

2014 03 31 17 05 26

What I've done here is create a bookmarklet that recognises that you are looking at an EOL page, it then calls the BioNames API and displays the original publication of the taxon displayed on the page (in this case, Pinnotheres atrincola). So, I've added the information from BioNames to the EOL page, without needing EOL to do anything.

But it gets better. We can do this with pretty much any web page. The example above displays the original publication of a taxon name, but imagine we are looking at the publisher's page for that article (you can see it here: http://dx.doi.org/10.1080/03014223.1983.10423904). Wouldn't it be nice if the publisher knew that this paper described a new species of crab? We could negotiate with the publisher about how to give them that information, and how they could display it, or we can just add it:

2014 03 31 17 15 08

This time the bookmarklet recognises that the web page has a DOI, then asks BioNames whether there have been any names published in the paper with that DOI, if it finds any they are displayed in the popup.

Bookmarklets enable you to enhance a web page with any information you like. This makes them ideal for displaying annotations on a page. If you want to try yourself, you can grab the bookmarklet from here.

Making annotations visible


Bookmarklets can be used to solve one part of the annotation problem, namely showing existing annotations. I have lots of exmaples of errors in datasets, I blog about some of these, I store some in Evernote for future reference, some end up in unfinished manuscripts, and so on. The problem is that these annotations are of little use to anyone else because if you go to GBIF you don't see my annotations (or, indeed, anyone else's). But we can use a bookmarklet to display these, without having to pester GBIF themselves to add this feature! Imagine a bookmarklet that you could click on and it shows you whether anyone as queried the identification of a specimen, or the location of a specimen?

Where do the annotations come from?


Of course, all this presupposes that we have annotations to start with. I think there are at least two classes of annotations. The first, most obvious annotations are ones that change or add attributes to an object. For example, adding latitude and longitude coordinates to a specimen. These are annotations we would want to display just on the corresponding page in the source database (e.g., displaying a map in the annotation popup on GBIF for a record we've georeferenced).

The second class comprise cross-links between data sets. For example, linking a species in EOL to DOI of the publication that first described that species. Or linking a specimen in GBIF to the sequences in GenBank that were obtained from that specimen. These annotations are different in that we might want to display them on multiple web pages (e.g., pages served by both a biodiversity database and an academic publisher). From this perspective, a database like BioNames is essentially a big store of annotations.

But we need more than this, we need to be able to annotate any class of data that is relevant to biodiversity. We need to be able to edit erroneous GBIF records, flag Genbank sequences that have been misidentified, document taxonomic names that are entirely spurious, and so on. And we need to make these annotations available via APIs so that anywone can access them. To me, it seems obvious that we need a single, centralised annotation store.

A global annotation store


One way to implement an annotation store would be to create a wiki-style database that the community could edit. This database gets populated with data that can then be edited, refined, and discussed. For example, imagine a GBIF user spots an occurrence that is clearly wrong (a frog in the middle of the ocean). They could have a bookmarklet that they click on, and it displays any existing annotations of that record. If there aren't any, let's imagine there is a link to the annotaion store. Clicking on that creates a record for that occurrence, and the user then edits that. Perhaps they discover that the latitude and longitudes have been swapped, so they swap them back, and save the record. The next person to go to that page in GBIF clicks on their bookmarklet and discoveres that there is a potential issue with that record (the popup displayed by the bookmarklet will have a "warning symbol", and an updated map).

Some annotations will be simple, some may require some analysis. For example, a claim that a GenBank sequence has been misidentified would be stronger if it was backed up by a BLAST analysis that demonstrated that the sequence was clustered with taxa that you would not expect based on its putitative identification.

We can also annotate in bulk, and upload these annotations directly to the annotation store. For example, we could map GBIF taxa to taxonomic name identifiers from nomenclators such as ION, ZooBank, IPNI, Index Fungorum, etc., then map those identifiers to the primary litertaure, and upload all of that data to the annotation store, making them available to anyone visiting GBIF (or, indeed, the nomenclators). We could BLAST DNA barcode sequences and suggest potential identifications. We could add lists of publications that cite museum specimen codes, and display those on the GBIF page that corresponds to each code. There is almost no limit to the richness of annotations we could add to existing webpages.

Filtered push


One aspect of annotation that I've glossed over is how the annotations get back to the primary data providers. There has been some work on this (see papers cited at the start), but in a sense I don't think this is the most pressing problem (in part because I suspect most providers are in no position to undertake the kind of data editing and cleaning required). My concern is at the other end of the process. Users of biodiversity data are frequently presented with data that is demonstrably erroneous, and it inconveniences them, as well as hurting the reputation of aggregators such as GBIF, or databases such as GenBank. Anyone doing an analysis of these sorts of data will spend some time cleaning and correcting the data, we desperately need mechanisms to capture these annotations and make them available to other users. The extent to which these annotations filter back to the primary data providers is, in my view, a less pressing issue.

That said, a central annotation store would have lots of advantages for primary providers. It's one place to go to get annotations. The fate of a user's edits could help develop metrics of reliability of annotations, and so on.

Summary


The reason I find this approach attractive is that it frees us from having to wait for projects like GBIF and GenBank to support annotations. We don't need to wait, we can simply do it ourselves right now. We can add overlays that augment existing data (e.g., adding original publications to EOL web pages), or flag errors. Take the example bookmarklet from here for a spin and see what it can do. It's very crude, but I think it gives an indication of the potential of this approach.

So, "all" we need is a centralised, editable, database of annotations that we can hook the bookmarklet into. Simples.

Friday, December 07, 2012

Elsevier articles have interactive phylogenies

Elsevier treeSay what you will about Elsevier, they are certainly exploring ways to re-imagine the scientific article. In a comment on an earlier post Fabian Schreiber pointed out that Elsevier have released an app to display phylogenies in articles they publish. The app is based on jsPhyloSVGand is described here. You can see live examples in these articles:

Matos-Maraví, P. F., Peña, C., Willmott, K. R., Freitas, A. V. L., & Wahlberg, N. (2013). Systematics and evolutionary history of butterflies in the “Taygetis clade” (Nymphalidae: Satyrinae: Euptychiina): Towards a better understanding of Neotropical biogeography. Molecular Phylogenetics and Evolution, 66(1), 54–68. doi:10.1016/j.ympev.2012.09.005
Poćwierz-Kotus, A., BurzyÅ„ski, A., & Wenne, R. (2010). Identification of a Tc1-like transposon integration site in the genome of the flounder (Platichthys flesus): A novel use of an inverse PCR method. Marine Genomics, 3(1), 45–50. doi:10.1016/j.margen.2010.03.001
Sampleimg2Sampleimg3

Thursday, December 06, 2012

NEXUS parser and tree viewer in Javascript

Following on from the SVG experiments I've started to put some of the Javascript code for displaying phylogenies on Github. Not a repository yet, but as gists, little snippets of code. Mike Bostock has created http://bl.ocks.org/ which makes it possible to host gists as working examples, so you can play with the code "live".

The first gist takes a Newick tree, parses it and displays a tree. You can try it at https://bl.ocks.org/d/4224658/.

The second gist takes a basic NEXUS file containing a TREES block and displays a tree (try it at http://bl.ocks.org/d/4229068/ ). You can grab examples NEXUS tree files from TreeBASE such as tree Tr57874.

NexusWhy am I doing this?
Apart from "because it's fun" there are two reasons. The first is that I want a simple way to display phylogenetic trees in web pages, and doing this entirely in the web browser (Javascript parses the tree and renders it in SVG) saves me having to code this on my server. Being able to do this in the browser opens up the opportunity to embed tree descriptions in HTML, for example, and have the browser render the tree. This means the same web page can have machine-readable data (the tree description) but also generate a nice tree for the reader. As an aside, it also shows that TreeBASE could display perfectly good, interactive trees without resorting to a Java appelet.

The other reason is that the web seems to be moving to Javascript as the default language, and JSON as the standard data format. Instead of large chunks of "middleware" (written in a scripting language such as Perl, PHP, or, gack, Java) which is responsible for talking to databases on the server and sending static HTML to the web browser, we now have browsers that can support sophisticated, interactive interfaces built using HTML and Javascript. On the server side we have databases that speak HTTP (essentially removing the need for middleware), store JSON, and use Javascript as their programming language (e.g., CouchDB). In short, it's Javascript, Javascript, everywhere.

Tuesday, June 12, 2012

Using a zoomable treemap to visualise a taxonomic classification

One visualisation method I keep coming back too is the treemap. Each time I experiment with them I learn a little bit more, but I usually end up abandoning them (with the exception of using quantum treemaps to display bibliographic data). But they keep calling me back.

My latest experiment builds on some earlier thoughts on quantum treemaps, but tackles two issues that have kept bugging me. The first is that quantum treemaps are limited to hierarchies that are only two levels deep (e.g., family → genus → species). This is because, unlike regular treemaps where you are slicing and dicing a rectangle of predetermined size, when you construct a quantum treemap you don't know how big it will be until you've made it (this is because you want to ensure that every item in the hierarchy can be displayed at the same size, and fitting them in may require you to tweak the size of the treemap). Given that taxonomic classifications have > 2 levels this is a problem. One approach is to construct quantum treemaps for the lower parts of the classification, then pack those into a larger rectangle. This is an instance of the packing problem. After Googling for a bit I came up across this code for packing rectangles, which was easy to follow and gave reasonable results.

The second problem is that I want the treemap to be interactive. I want to be able to zoom in and out and navigate around the treemap. After more Googling, I came across the Zoomooz.js library which makes web page elements zoom (for a pretty mind-blowing example of what can be done see impress.js), but I decided I want to work with SVG. After playing with examples from Keith Wood's jQuery SVG plugin I started to get the hang of creating zoomable visualisations in SVG.

Here's a video of what I've come up with so far (you can see this live at http://iphylo.org/~rpage/zoomrect/primates.html). This is an interactive display of the Catalogue of Life 2010 classification of primates, with images from EOL. It's crude, there are some obvious issues with redrawing images, labels, etc., but it gives a sense of what can be done. With care this could probably be scaled up to handle the entire Catalogue of Life classification. With a bit more care, it could probably be optimised for the iPad, which would be a fun way to navigate through the diversity of life.

Friday, November 18, 2011

Nature iPhone app clone in GitHub

One thing I'm increasingly conscious of is that I've a lot of demos and toy projects hanging around and the code for most of these isn't readily available. So, I plan to clean these up and put them in GitHub so others can explore the code, and reuse it if they see fit.

First up is the code to create a HTML+Javascript clone of Nature's iPhone app, as described in an earlier post.

photo.PNGphoto.PNG


There's a live version of the clone here here. and the code is now available from GitHub at https://github.com/rdmpage/natureiphone.


Thursday, December 09, 2010

Viewing scientific articles on the iPad: cloning the Nature.com iPhone app using jQuery Mobile

Over the last few months I've been exploring different ways to view scientific articles on the iPad, summarised here. I've also made a few prototypes, either from scratch (such as my response to the PLoS iPad app) or using Sencha Touch (see Touching citations on the iPad).

Today, it's time for something a little different. The Sencha Touch framework I used earlier is huge and wasn't easy to get my head around. I was resigning myself to trying to get to grips with it when jQuery Mobile came along. Still in alpha, jQuery Mobile is very simple and elegant, and writing an app is basically a case of writing HTML (with a little Javascript here and there if needed). It has a few rough edges, but it's possible to create something usable very quickly. And, it's actually fun.

So, to learn a it more about how to use it, I decided to see if I could write a "clone" of Nature.com's iPhone app (which I reviewed earlier). Nature's app is in many ways the most interesting iOS app for articles because it doesn't treat the article as a monolithic PDF, but rather it uses the ePub format. As a result, you can view figures, tables, and references separately.

The cloneYou can see the clone here.

photo.PNGphoto.PNG


I've tried to mimic the basic functionality of the Nature.com app in terms of transitions between pages, display of figures, references, etc. In making this clone I've focussed on just the article display.

A web app is going to lack the speed and functionality of a native app, but is probably a lot faster to develop. It also works on a wider range of platforms. jQuery Mobile is committed to supporting a wide range of platforms, so this clone should work on platforms other than the iPad.

The Nature.com app has a lot of additional functionality apart from just displaying articles, such as list the latest articles from Nature.com journals, manage a user's bookmarks, and enable the user to buy subscriptions. Some of this functionality would be pretty easy to add to this clone, for example by consuming RSS feeds to get article lists. With a little effort one could have a simple, Web-based app to browse Nature content across a range of mobile devices.

Technical stuff

Nature's app uses the ePub format, but Nature's web site doesn't provide an option to download articles in ePub format. However, if you use a HTTP debugging proxy (such as Charles Proxy) when using Nature's app you can see the URLs needed to fetch the ePub file.

I grabbed a couple of ePub files for articles in Nature communications and unzipped them (.epub files are zip files). The iPad app is a single HTML file that uses some Ajax calls to populate the different views. One Ajax call takes the index.html that has the article text and replaces the internal and external links with calls to Javascript functions. An article's references, figure captions, and tables are stored in separate XML files, so I have some simple PHP scripts that read the XML and extract the relevant bits. Internal links (such as to figures and references) are handled by jQuery Mobile. External links are displayed within an iFrame.

There are some intellectual property issues to address. Nature isn't an Open Access journal, but some articles in Nature Communications are (under the Commons Attribution-NonCommercial-Share Alike 3.0 Unported License), so I've used two of these as examples. When it displays an article, Nature's app uses Droid fonts for the article heading. These fonts are supplied as an SVG file contained within the ePub file. Droid fonts are available under an Apache License as TrueType fonts as part of the Android SDK. I couldn't find SVG versions of the fonts in the Android SDK, so I use the TrueType fonts (see Jeffrey Zeldman's Web type news: iPhone and iPad now support TrueType font embedding. This is huge.). Oh, and I "borrowed" some of the CSS from the style.css file that comes with each ePub file.

Friday, October 08, 2010

Towards an interactive DjVu file viewer for the BHL

The bulk of the Biodiversity Heritage Library's content is available as DjVu files, which package together scanned page images and OCR text. Websites such as BHL or my own BioStor display page images, but there's no way to interact with the page content itself. Because it's just a bitmap image there's no obvious way to do simple things such as select and copy some text, click on some text and correct the OCR, or highlight some text as a taxonomic name or bibliographic citation. This is frustrating, and greatly limits what we can do with BHL's content.

In March I wrote a short post DjVu XML to HTML showing how to pull out and display the text boxes for a DjVu file. I've put this example, together with links to the XSLT file I use to do the transformation online at Display text boxes in a DjVu page. Here's an example, where each box (a DIV element) corresponds to a fragment of text extracted by OCR software.

boxes.png

The next step is to make this interactive. Inspired by Google's Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?), I've revisited this problem. One thing the Google PDF viewer does nicely is enable you to select a block of text from a PDF page, in the same way that you can in a native PDF viewer such as Adobe Acrobat or Mac OS X Preview. It's quite a trick, because Google is displaying a bitmap image of the PDF page. So, can we do something similar for DjVu?

The thing I'd like to do is something what is shown below — drag a "rubber band" on the page and select all the text that falls within that rectangle:

textselection.png

This boils down to knowing for each text box whether it is inside or outside the selection rectangle:

selection.png


Implementation

We could try and solve this by brute force, that is, query each text box on the page to see whether it overlaps with the selection or not, but we can make use of a data structure called an R-tree to speed things up. I stumbled across Jon-Carlos Rivera's R-Tree Library for Javascript, and so was inspired to try and implement DjVu text selection in a web browser using this technique.

The basic approach is as follows:

  1. Extract text boxes from DjVu XML file and lay these over the scanned page image.

  2. Add each text box to a R-tree index, together with the "id" attribute of the corresponding DIV on the web page, and the OCR text string from that text box.

  3. Track mouse events on the page, when the user clicks with the mouse we create a selection rectangle ("rubber band"), and as the mouse moves we query the R-tree to discover which text boxes have any portion of their extent within the selection rectangle.

  4. Text boxes in the selection have their background colour set to an semi-transparent shade of blue, so that the user can see the extent of the selected text. Boxes outside the selection are hidden.

  5. When the user releases the mouse we get the list of text boxes from the R-tree, and concatenate the text corresponding to each box, and finally display the resulting selection to the user.



Copying text

So far so good, but what can we do with the selected text? One obvious thing would be to copy and paste it (for example, we could select a species distribution and paste it into a text editor). Since all we've done is highlight some DIVs on a web page, how can we get the browser to realise that it has some text it can copy to the clipboard? After browsing Stack Overflow I came across this question, which gives us some clues. It's a bit of a hack, but behind the page image I've hidden a TEXTAREA element, and when the user has selected some text I populate the TEXTAREA with the corresponding text, then set the browser's selection range to that text. As a consequence, the browser's Copy command (⌘C on a Mac) will copy the text to the clipboard.

Demo

You can view the demo here. It only works in Safari and Chrome, I've not had a chance to address cross-browser compatibility. It also works in the iPad, which seems a natural device to support interactive editing and annotation of BHL text, but you need to click on the button On iPad click here to select text before selecting text. This is an ugly hack, so I need to give a bit more thought to how to support the iPad touch screen, while still enabling users to pan and zoom the page image.

Next steps

This is all very crude, but I think it shows what can be done. There are some obvious next steps:

  • Enable selected text to be edited so that we can correct the underlying OCR text.

  • Add tools that operate on the selected text, such as check whether it is a taxonomic name, or if it is a bibliographic citation we could attempt to parse it and locate it online (such as David Shorthouse's reference parser).

  • Select parts of the page image itself, so that we could extract a figure or map.

  • Add "post it note" style annotations.

  • Add services that store the edits and annotations, and display annotations made by others.


Lots to do. I foresee a lot of Javascript hacking over the coming weeks.