iPhylo: June 2008

Roderic D. M. Page

Monday, June 23, 2008

PhyQL

Hasan Jamil has released PhyQL, a visual system for querying
phylogenetic information. To quote from the web site:

Popular phylogenetic databases such as TreeBASE, PhyloFinder, TreeFAM offer complex text-based web forms for structure queries. Still there seems a great need for intelligent visual query formation based on a phylogenetic query language for content exploration. PhyQL offers a visual query design interface where the user can create simple to complex queries based visual query operators. The query language is translated to a list of datalog queries, then executed in XSB, an extension of Prolog. Separating the application layer from the data layer by a logic layer reduces query tools development time. Moreover, PhyQL offers interactive tree visualization which is very convenient for viewing very large trees.

There is also a YouTube screencast:

I haven't had a chance to play with it yet. PhyQL was originally described by Jamil et al. "Querying phylogenies visually", BIBE 2001 (doi:10.1109/BIBE.2001.974405).

Friday, June 20, 2008

I've put the first version of "tvwidget" into Google Code. This is a HTML-only widget to display large evolutionary trees (you can see how my thoughts on how to do this unfolded by following my earlier posts starting with Visualising very big trees, Part V). tvwidget itself is a C++ program that takes a tree and generates the image tiles and Javascript for the viewer. It's poorly documented, I'll deal with tis once I get some time.

You can see a live demo of tvwidget displaying Bininda-Emonds et al.'s mammal supertree published in Nature (doi:10.1038/nature05634). The tree is the first one in Supplementary Figure 1.

Friday, June 13, 2008

From PDFs to Google Earth

I've added a service to bioGUID that takes a PDF and attempts to extract latitude and longitude data from the PDF, returning those co-ordinates in either a Google Earth KML file, or in JSON format. This is one of a bunch of services that I'm adding to bioGUID to support some of the data mining that I'm doing.

To see what it can do, try this URL to get a list of localities in the paper Description of eight new species of shrub frogs (Ranidae: Rhacophorinae: Philautus) from Sri Lanka.

Then try this one to get the KML file, and open it in Google Earth. The service uses a bunch of regular expressions to try and extract latitude and longitude pairs from the text (needless to say, there are nearly as many different ways to write a latitude and longitude as there are authors).

The ultimate aim is to assemble a bunch of Open Access PDFs (say, from Zootaxa), run them through this service, then display the result on Google Earth. Think of it as a geography of taxonomy.

Oh, and the irony of me criticising GBIF for displaying poor quality data, then adding to this by providing a service to extract yet more co-ordinates of possibly doubtful validity has not entirely escaped me...

Wednesday, June 11, 2008

More GBIF errors, courtesy of FishBase

Resurrecting iSpecies after moving it to a new folder on one of my servers, and browsing popular searches, I keep coming across clearly erroneous distributions. FishBase seems a major culprit. For example, the common pandora Pagellus erythrinus is a marine fish, yet GBIF displays numerous occurrences in mainland Africa (dots with black centre on map below).

What gives? Well, after struggling with the somewhat non-intuitive GBIF web site I found that the erroneous records are from FishBase. As for the frog example I blogged about earlier, the actual records have locality information indicating most of the records come from the Mediterranean, but the latitude and longitudes are reversed. Swapping these, the records show a more believable distribution (white dots on SVG map below). If you don't see the map, use a decent web browser such as Safari 3 or Firefox 2. If you must use Internet Explorer, grab the RENESIS player.

I know I've harped on about this before, but surely the time is ripe for some clever data cleaning? Especially if users start to loose their trust in GBIF.

Tuesday, June 10, 2008

Catalogue of Life as a treemap

I have an "on again/off again" relationship with treemaps. Lately, I've been taking another look, partly inspired by Björn Engdahl's MSc thesis Ordered and Unordered Treemap Algorithms and Their Applications on Handheld Devices. He describes a simple treemap algorithm which he calls Split Layout. It has the nice properties of having a good aspect ratio (most cells in the treemap are approximately square) and it keeps the cells in roughly the original order. This later property is important as one thing I find distracting with tree diagrams is if the order of the objects in the tree keep changing.

I also have an "on again/off again" relationship with the Catalogue of Life, which is potentially very useful, but seems determined to undermine this with some poor design decisions. But, I finally bit the bullet and extracted a complete classification from the 2008 edition of the Catalogue of Life. I downloaded an ISO image, burnt a CD, installed it on a Windows box (gack), grabbed the MySQL database files, and put those on my MacBook Air. Using some tools I developed for working with the NCBI taxonomy, I wanted to extract the tree from the taxa table, only to discover that this table isn't a tree. Not all the taxa in the table are flagged is_accepted_name, and if you remove those, then the remaining taxa don't form a tree. It's clear that some taxa have been orphaned when the table was created. For example, Enteromorpha flexuosa is not an accepted name, and is flagged as such in the taxa table, yet it is has four child taxa that are accepted (Enteromorpha flexuosa subsp. linziformis, Enteromorpha flexuosa subsp. biflagellata, Enteromorpha flexuosa subsp. pilifera, and Enteromorpha flexuosa forma submarina). These taxa are orphaned in the tree. Eventually I gave up trying to extract the tree using SQL, and had to traverse the entire structure starting at the root node. This extracts a tree, at the cost of the orphans. It appears that Catalogue of Life haven't checked whether there classification is, in fact, a tree (OK, technically it is a forest as it is a set of disjoint trees comprising the eight kingdoms CoL recognises, but I make it a tree by rooting it on a node called "life").

After much anguish, I have a tree. I then coded up Engdahl's algorithm, based on the pseudocode he provides on p. 31 of his thesis (I think there's a bug in his code as he doesn't deal with the case when the cell being partitioned is narrower than it is wide, but this was easy to fix). One thing I was keen to do is just use HTML, no SVG or Flash. Here's an example of the treemap, showing the eight kingdoms. Each taxon is drawn proportional to log₁₀(n + 1), where n is the number of terminal taxa (i.e., species or below) in that taxon (the number of terminals is shown in each cell). The log scale was chosen to avoid mega-diverse groups crowding out the smaller taxa.

Animalia 892,966

Archaea 281

Bacteria 9,588

Chromista 6,855

Fungi 33,017

Plantae 206,843

Protozoa 6,435

Viruses 1,906

The live version is here. It's a bit crude (to go back up the tree just use your browser back button), but it's simple, and it's HTML. The underlying code is PHP, but it would be quite easy to convert this to Javascript to make a simple drop in widget. In addition to Björn Engdahl's algorithm, and the Catalogue of Life data, I should acknowledge Samson's code for generating colour gradients.

There are all sorts of things that could be done to improve this. One approach would be to include exemplar pictures of the taxa in each cell, to help navigate in unfamiliar taxa. Denise Green and Rebecca Shapley's Teaching with a visual tree of life report has some examples of this idea (see their p. 86), and Marcos Weskamp (author of the very cool newsmap) has done a mockup for EOL using Flash.

As to the treemap idea itself, there are some fun things which could be done with it. I'm not convinced that it is great for navigation. However, it is probably very useful for showing changes over time. For example, imagine making the State of Observed Species report dynamic. Take the uBio RSS feed for new names, classify the new names, then colour the treemap cells by the number of new names (in a sense, this is a taxonomic version of newsmap).

Wednesday, June 04, 2008

Stained Glass

Something a little different, my dad's work features in the Winter 2008 issue of Art News New Zealand by Rob Garrett, who has a copy of the article on his web site. Dad designed eight stained glass windows for Westlake Boys High School in Auckland, New Zealand.
Rob Garrett describes them thus:

The challenge when designing stained glass windows as a commemoration of 50 years of a high school is tough, no matter how open the brief. There is no unifying faith; no character-driven narrative; and no beginning, middle and end in the dramatic sense. There is only bewildering complexity; layer upon layer of possibility. [Dugald] Page has tackled this complexity head on by establishing unifying strands that weave through all eight panels: prismatic colour, seasonal change and the metaphor of metamorphosis. Stitched into these strands is a palimpsest of iconic symbols of science and technology, sport, the Pacific, discovery, historic events, literature and the arts.

They are gorgeous to look at (the picture below shows them before they were mounted in the school hall).