Monday, February 28, 2011

Live demo of zooming a large tree

After the teaser on Friday (see Deep zooming a large 2D tree) I've put a live demo of my experiments with viewing a large tree online at:

http://iphylo.org/~rpage/deeptree/

The first example (Experiment 1) is the NCBI classification for frogs:

This version displays internal node labels, leaf labels (as many as can be displayed at a given zoom level), and works in Safari, Firefox, and Internet Explorer 8. Obviously this is all pretty rough, but take it for a spin, I'd welcome any feedback.

Friday, February 25, 2011

Deep zooming a large 2D tree

Here's a quick demo of a 2D large tree viewer that I'm working on. The aim is to provide a simple way to view and navigate very large trees (such as the NCBI classification) in a web browser using just HTML and Javascript. At the moment this is simply a viewer, but the goal is to add the ability to show "tracks" like a genome browser. For example, you could imagine columns appearing to the right of the tree showing you whether there are phylogenies available for these taxa in TreeBASE, images from Wikipedia, sparklines for sequencing activity over time, etc. I'll blog some more on the implementation details when I get the chance, but it's pretty straightforward. Image tiles are generated from SVG images of tree using ImageMagick, labelling is applied on the fly using GIS-style queries to a MySQL database that holds the "world coordinates" of the nodes in the tree (see discussion of world coordinates on Google's Map API pages), and the zooming and tile fetching is based on Michal Migurski's Giant-Ass Image Viewer. Once I've tidied up a few things I'll put up a live demo so people can play with it.

Thursday, February 24, 2011

Why 3D phylogeny viewers don't work

Matt Yoder (@mjyoder had a Twitter conversation yesterday about phylogeny viewers, prompted by my tweeting about my latest displacement activity, a 2D tree browser using the tiling approach made popular by Google Maps.

As part of that conversation, Matt tweeted:
RT @rdmpage: @mjyoder - I think 3D is the worse thing we could do, there's no natural mapping to 3D. <- meh, where's the imagination?

Well, Matt's imagination has gone into overdrive, and he's blogged about his ideas.

3d_tree_browsing.jpg


This issue deserves more exploration, but here are some quick thoughts. 3D has been used in a number of phylogeny browsers, such as Mike Sanderson's Paloverde, Walrus, and the Wellcome Trust's Tree of Life. I don't find any terribly successful, pretty as they may be. I think there are several problems with trees in general, and 3D versions in particular.

Trees aren't real
Trees aren't real in the same way that the physical world is (or even imagined physical worlds). Trees are conceptual structures. The history of web interfaces is littered with attempts to visualise conceptual space, for example to summarise search results. These have been failures, a simple top ten list as used by Google wins. I don't think this is because Google's designers lack imagination, it's because it works. Furthermore, this is actually a very successful visualisation:


I think elaborate attempts to depict conceptual spaces on screens are mostly going to fail.

Trees are empty
Compared to, say, a geographic map, trees are largely empty space. In a map every pixel counts, in that it potentially represents something. Think of the satellite view in Google Maps. Each pixel on the screen has information. Trees are largely empty, hence much of the display space is wasted. Moving trees to 3D just gives us more space to waste.

Trees don't have a natural ordering
Even if we accept that trees are useful visualisations, they have problems. Given the tree ((1,2),(3,4)); we have a lot of (perhaps too much) freedom in how we can depict that tree. For example, both diagrams below depict this tree. In the x-axis there is a partial order of internal nodes (the ancestor of {1,2} must be to the right of the ancestor {1,2,3,4}), but the tree ((1,2),(3,4)); says nothing about the relative ordering of {1,2} versus {3,4}. We are free to choose. A natural linear ordering would be divergence time, but estimates of those times can be contested, or unavailable.

order.png


Phylogenies are unordered trees in the sense that I can rotate any node about it's ancestor and still have the same tree (compare the two trees above). Phylogenies are like mobiles:


The practical consequence of this is that different tree viewers can render the same tree in very different ways, making navigation across viewers unpredictable. Compare this to maps. Even if I use different projections, the maps remain recognisably similar, and most maps retain similar relationships between areas. If I look at a map of Glasgow and move left I will end up in the Atlantic Ocean, no matter if I use Google Maps or Microsoft Maps. Furthermore, trees grow in a way that maps don't (at least, not much). If I add nodes to a tree it may radically change shape, destroying navigation cues that I may have relied on before. Typically maps change by the addition of layers, not by moving bits around (paleogeographic maps excepted).

Trees aren't 3D
There's nothing intrinsically 3D about trees, which means any mapping to 3D space is going to be arbitrary. Indeed, most 3D viewers simply avoid any mapping and show a 2D tree in 3D space, which seems rather pointless.

Perhaps it's because I don't play computer games much (went through an Angry Birds phase, and occasionally pick up an X-Box controller, only to be mercilessly slaughtered by my son), but I'm not inspired by the analogy with computer games. I'm not denying that there are useful things to learn from games (I'm sure the controls in Google Earth owe something to games). But games also rely on a visceral connection with the play, and an understanding of the visual vocabulary (how to unlock treasure, etc.). Matt's 3D model requires users to learn a whole visual vocabulary, much of which (e.g., "Fruit on your tree? Someone has left comment(s) or feedback. ") seems forced.

My sense is that the most successful interfaces make the minimal demands on users, don't fight their intuition, and don't force them to accept a particular visualisation of their own cognitive space.

I'll write more about this once I get my 2D tree viewer into shape where it can be shown. It will be a lot less imaginative than Matt's vision, all I'm shooting for is that it is usable.




Friday, February 18, 2011

Why metadata matters

Quick note to express the frustration I experience sometimes when dealing with taxonomic literature. As part of a frankly Quixotic desire to link every article cited in the Australian Faunal Directory (AFD) to the equivalent online resource (for example, in the Biodiversity Heritage Library using BioStor, or to a publisher web site using a DOI) I sometimes come across references that I should be able to find yet can't. Often it turns out that the metadata for the article is incorrect. For example, take this reference:
Report upon the Stomatopod crustaceans obtained by P.W. Basset-Smith Esq., surgeon R.N. during the cruise, in the Australia and China Sea, of H.M.S. "Penguin", commander W.V. Moore. Ann. Mag. Nat. Hist. Vol. 6 pp. 473-479 pl. 20B
which is in the Australian Faunal Directory (urn:lsid:biodiversity.org.au:afd.publication:087892ae-2134-4bb4-83ae-8b8cbd15b299). Using my OpenURL resolver in BioStor I failed to locate this article. Sometimes this is because the code I used to parse references from AFD mangles the reference, but not in this case. So, I Google the title and find a page in the Zoological catalogue of Australia: Aplacophora, Polyplacophora, Scaphopoda:


Here's the relevant part of this page:
Zoocat
Same as AFD, Ann. Mag. Nat. Hist. volume 6, pages 473-479, 1893.

In despair I looked at the BHL page for The Annals and Magazine of Natural History and discover that there is no volume 6 published in 1893. There is, however, series 6. Oops! Browsing the BHL content I discover the start of the article I'm looking for on BHL page 27734740 , volume 11 of series 6 of The Annals and Magazine of Natural History. Gotcha! So, I can now link AFD to BHL like this.

I should stress that in general AFD is an great resource for someone like me trying to link names to literature and, to be fair, with its reuse of volume numbers across series The Annals and Magazine of Natural History can be a challenge to cite. Usually the bibliographic details in AFD are accurate enough to locate articles in BHL or CrossRef, but every so often references get mangled, misinterpreted, or someone couldn't resist adding a few "helpful" notes to a field in the database, resulting in my parser failing. What is slightly alarming is how often when I Google for the reference I find the same, erroneous metadata repeated across several articles. This, coupled with the inevitable citation mutations can make life a little tricky. The bulk of the links I'm making are constructed automatically, but there are a few cases where one is lead on a wild goose chase to find the actual reference.

Although this is an example of why it matters to have accurate metadata, it can also be seen as an argument for using identifiers rather than metadata. If these references had stable, persistent identifiers (such as DOIs) that taxonomic databases cited, then we wouldn't need detailed metadata, and we could avoid the pain of rummaging around in digital archives trying to make sense of what the author meant to cite. Until taxonomic databases routinely use identifiers for literature, names and literature will be as ships that pass in the night.

Sunday, February 06, 2011

Why is the Atlas of Living Australia is invisible to Google?

Jeff Atwood, one of the co-founders of Stack Overflow recently wrote a blog post Trouble In the House of Google, where he noted that several sites that scrape Stack Overflow content (which Stack Overflow's CC-BY-SA license permits) appear higher in Google's search rankings than the original Stack Overflow pages. When Stack Overflow chose the CC-BY-SA license they made the assumption that:
...that we, as the canonical source for the original questions and answers, would always rank first...That's why Joel Spolsky and I were confident in sharing content back to the community with almost no reservations – because Google mercilessly penalizes sites that attempt to game the system by unfairly profiting on copied content.
Jeff Atwood's post goes on to argue that something is wrong with the way Google is ranking sites that derive content from other sites.

I was reminded of this post when I started to notice that searches for fairly obscure Australian animals would often return my own web site Australian Faunal Directory on CouchDB as the first hit. In one sense this is personally gratifying, but it can also be frustrating because when I Google these obscure taxa it's usually because I'm trying to find data that isn't already in one of my projects.

unotata.pic1.JPGBut what I've also noticed is that the site that I obtained the data from, Australian Faunal Directory (AFD), rarely appears in the Google search results. In fact, there are taxa for which Google doesn't find the corresponding page in AFD. For example, if you search for Uxantis notata (shown here in an image from the Key to the planthoppers of Australia and New Zealand) the first hit(s) are from my version of AFD:
Snapshot 2011-02-06 14-05-44.png


Neither the original AFD, nor the Atlas of Living Australia (ALA), which also builds on AFD, appear in the top 10 hits.

Initially I though this is probably an artefact. This is a pretty obscure taxon, maybe things like rounding error in computing PageRank are going to affect search rankings more than anything else. However, if I explicitly tell Google to search for Uxantis notata in the domain environment.gov.au I get no hits whatsoever:

Snapshot 2011-02-06 14-10-32.png

Likewise, the same search restricted to ala.org.au finds nothing, nothing at all. Both AFD and Atlas of Living Australia have pages for this taxon, here, and here, so clearly something is deeply wrong.

Why are the original providers of the data not appearing in Google search results at all? For someone like me who argues that sharing data is a good thing, and sites that aggregate and repurpose data will ultimately benefit the original data providers (for example by sending traffic and Google Juice) this is somewhat worrying. It seems to reinforce the fear that many data providers have: "if I share my data someone will make a better web site than mine and people will go to that web site, rather than the one I've created with my hard-won data." It may well be that data aggregators will score higher than data providers in Google searches, but I hadn't expected data providers to be virtually invisible.

atlasaustraliasm.gifGoogle isn't the problem
If a web site that I hacked together in a few days does better in Google searches than the rather richer pages published by sites such as ALA (with a budget of over $AU 30 million), something is wrong. Unlike the Stack Overflow example discussed above, I don't think the problem here is with Google.
If we search in Google for an "iconic" Australian taxon by name, say the Koala Phascolarctos cinereus, Wikipedia is the first hit (which should be no surprise). ALA doesn't appear in the top ten. If we tell Google to just search the domain ala.org.au we get lots of pages from ALA, but not the actual species page for Phascolarctos cinereus. This suggests that there is something about the way ALA's website works that prevents Google indexing it properly. I'm also a little worried that a major biodiversity project which has as its aim
...to improve access to essential information on Australia’s biodiversity
is effectively invisible to Google.



Friday, February 04, 2011

Web Hooks and OpenURL: the screencast

Yesterday I posted notes on Web Hooks and OpenURL. That post was written when I was already late (you know, when you say to yourself "yeah, I've got time, it'll just take 5 minutes to finish this..."). The Web Hooks + OpenURL project is still very much a work in progress, but I thought a screen cast would help explain why I think this is going to make my life a lot easier. It shows an example where I look at a bibliographic record in one database (AFD, the Australian Faunal Directory on CouchDB), click on a link that takes me to BioStor — where I can find the reference in BHL — then simply click on a button on the BioStor page to "automagically" update the AFD database. The "magic" is the Web Hook. The link I click on in the AFD database contains the identifier for that entry in the AFD, as well a a URL BioStor can call when it's found the reference (that URL is the "web hook").

Using Web Hooks and OpenURL from Roderic Page on Vimeo.



Thursday, February 03, 2011

Web Hooks and OpenURL: making databases editable

For me one of the most frustrating things about online databases is that they often can't be edited. For example, I've recently created a version of the Australian Faunal Directory on CouchDB, which contains a list of all animals in Australia, and a fairly comprehensive bibliography of taxonomic publication on those animals. What I'd like to do is locate those publications online. Using various scripts I've found DOIs for some 2,500 articles, and located nearly 4,900 article in BHL, and added these to the database, but browsing the database (using, say, the quantum treemap interface) makes it clear there are lots of publications that I've missed.

It would be great if I could go to the Australian Faunal Directory on CouchDB and edit these on that site, but that would require making the data editable, and that means adding a user interface. And that's potentially a lot of work. Then, if I go to another database (say, my CouchDB version of the Catalogue of Life) and want to make that editable then I have to add an interface to that database as well. I could switch to using a wiki, which I've done for some projects (such as the NCBI to Wikipedia mapping), but wikis have their own issues (in particular, they don't easily support the kinds of queries I want to do).

There is, as they say, a third way: web hooks. I first came across web hooks when I discovered that Post-Commit Web Hooks in Google Code. The idea is you can create a web service that gets called every time you commit code to the Google Code repository. For example, each time you commit code you can call a web hook that uses the Twitter API to tweet details of what you just committed (I tried this for a while, until some of my Twitter followers got seriously pissed off by the volume of tweets this was generating).

What has this to do with making databases editable? Well, imagine the following scenario. A web page displays a publication, but no DOI. However, the web page embeds an OpenURL in the form of a COinS (in other words, a URL with key-value pairs describing the publication). If you use a tool such as the OpenURL Referrer in Firefox you can use an OpenURL resolver to find that publication. Examples of OpenURL resolvers include bioGUID and BioStor. Let's say you find the publication, and it has a DOI. How do you tell the database about this? Well, you can try and find an email address of someone running the database so you can send them the information, but this is a hassle. What if the OpenURL resolver that you used to find the DOI could automatically tell the source database that it's found the DOI? That's the idea behind web hooks.

I've started to experiment with this, and have most of the pieces working. Publication pages in Australian Faunal Directory on CouchDB have COinS that include two additional pieces of information: (1) the database identifier for the publication (in this case a UUID, in the hideously complex jargon of OpenURL this the "Referring Entity Identifier"), and (2) the URL of the web hook. The idea is that an OpenURL resolver can take the OpenURL and try and locate the article. If it succeeds it will call the web hook URL supplied by the database, tell it "hey, I've found this DOI for the publication with this database identifier". The database can then update its data, so the next time a user visits the page for that publication in the database, the user will see the DOI. This has the huge advantage over tools that just modify the web page on the fly, such as David Shorthouse's reference parser of persistence: the database itself is updated, not just the web page.

In order to make this work, all the database needs to do is have a web hook, namely a URL that accepts POST requests. The heavy lifting of searching for the publication, or enabling users to correct and edit the data can be devolved to a single place, namely the OpenURL resolver. As a first step I'm building an OpenURL resolver that displays a form the in which the user can edit bibliographic details, and launch searches in CrossRef (and soon BioStor). When the user is done they can close the form, which is when it calls the web hook with the edited data. The database can then choose to accept or reject the update.

Given that it's easy to create the web hook, and trivial to get a database to output an OpenURL with its internal identifier and the URL of the web hook, this seems like a light-weight way of making databases editable.