Saturday, December 31, 2005

Bone Rooms, Bird Bodies, and Biodiversity Informatics

Bone Rooms, Bird Bodies, and Biodiversity Informatics is an old article now, but it's a nice summary of what biodiversity informatics is about.

Some people believe that museums contain only musty air, stuffy docents, and pure boredom. However, tucked away behind a mysterious door marked "Museum Staff Only" is a dynamic and ever-growing resource few of us are lucky enough to see in person: the museum collection itself. Whether you imagine graybeards stirring up dust as they pin shiny beetles into tiny boxes or a sparkling modern facility, every museum's beating heart is its hidden collection of specimens and associated library of descriptive notebooks. These collections are anything but boring, and many are now online.

Edit script for classifications

One of the first concrete things to emerge from this research is a paper with Gabriel Valiente entitled An edit script for taxonomic classifications.

Abstract The NCBI taxonomy provides one of the most powerful ways to navigate sequence data bases but currently users are forced to formulate queries according to a single taxonomic classification. Given that there is not universal agreement on the classification of organisms, providing a single classification places constraints on the questions biologists can ask. However, maintaining multiple classifications is burdensome in the face of a constantly growing NCBI classification. In this paper, we present a solution to the problem of generating modifications of the NCBI taxonomy, based on the computation of an edit script that summarises the differences between two classification trees. Our algorithms find the shortest possible edit script based on the identification of all shared subtrees, and only take time quasi linear in the size of the trees because classification trees have unique node labels.

The basic idea is to look for matching subtrees in two classifications (labelled rooted trees), then compute a script that transforms one tree into another. I think the idea is neat, and we have a basic implementation available (written C++ using Graph Template Library). Haven't yet made practical use of it though...

Wednesday, December 21, 2005

Drupal and Atom

Minor technical matter, but I've discovered that the news aggregator for Drupal doesn't read Atom feeds, such as those provided by Blogger, and hence the Atom feed for iPhylo (this blog) did show up in the Systematic Biology web site (which uses Drupal). A quick Google revealed this solution, and so FeedBurner to the rescue. The iPhylo feed on the Systematic Biology site is provided by FeedBurner, not directly from IPhylo.

Monday, December 19, 2005

Structural-based queries

The IEEE SMC Society's eNewsletter has short article on work in Jason Wang's group on struture-base queries.

The goal of our research project is to produce algorithms, data structures, and software that approach the speed of keyword-based search engines for structure-based queries on biological databases. Thanks to previous and ongoing research, searching by attribute-value, by text, and by path expression has become a sophisticated technology. Searching by topological or physical structure, especially for biological databases and especially for approximate matches, is still an art.

Saturday, December 17, 2005

Google, Yahoo, and the end of taxonomy?

Wednesday December 7th I gave a talk at the Systematics Association's AGM in London, with the slightly tongue in cheek title Google, Yahoo, and the end of taxonomy?. It summarises some of the ideas that lead me to create

For fun I've made a Quicktime movie of the presentation. Sadly there is no sound. Be warned that if you are offended by even mild nudity, this talk is not for you.

The presentation style was inspired by Dick Hardt's wonderful presentation on Identity 2.0.

Distributed querying for species is a little toy I created to investigate how easy it would be to create a web page for each species of organism "on the fly" by using web services. That is, querying remote services, such as NCBI, Yahoo Image Search, and Google Scholar, and assembling the results into a single page. The toy seems useful, and has attracted a little attention (e.g., Science's Netwatch column for 16 December 2005). There's a blog for iSpecies that will record developments as I play around with these ideas some more.

The European Tree of Life

When Europe was seriously considering a Tree of Life effort to match the NSF ATOL effort (for some background and a hint of politics involved, read the letter to Science by Vincent Savolainen and Mark W. Chase of Kew Gardens doi:10.1126/science.302.5652.1894a), I assembled a web page of various links that I felt were of interest. The EU's TOL effort collapsed, but the page lives on.

Visualising the Tree of Life

This is an area which has received a lot of attention. For some examples see Rebecca Shapley's Tree of Life Gallery, and the information aesthetics post. I've used hyperbolic trees in my (now rather old) Glasgow Taxonomy Name Server.

To be honest, I've never felt that we've hit on a truly compelling way to visualise large trees. I'm particularly concerned that trees are sparse (there's a lot of empty space when drawing a tree), and it is often not obvious how to navigate around them. Contrast this with Google Maps, which has spawned masses of applications (see Google Maps Mania). I think maps are something everybody "gets", partly because you only move in two dimensions, and they are predictable (if I start in Glasgow and move west, I know that sooner or later I will hit the Atlantic Ocean). Furthemore, maps have a universal coordinate system, so that an item can be located by its latitude and longitude, and maps are visually dense (zoom in and more information appears - this is best seen on the wonderful Google Earth). In evolutionary and taxonomic trees, by contrast, there is no predictibility (if I move left what taxa will I encounter?), and no coordinate system (we can't easily add things to a tree, without of course, changing the tree).

Seems to be an area ripe for more work.

Towards a Taxonomically Intelligent Phylogenetic Database

For a short, but reasonably technical sumary of what I think the issues are, please read this "Technical Report", which I presented at the Workshop on Database Issues in Biological Databases (DBiBD) in Edinburgh in January 2005. This document is itself based on a BBSCR grant proposal which was funded. Here is the abstract.

This note outlines some of the key intellectual obstacles that stand in the way of creating a usable phylogenetic database. These challenges include the need to accommodate multiple taxonomic names and classifications, and the need for tools to query trees in biologically meaningful ways. Until these problems are addressed, and a taxonomically intelligent phylogenetic database created, much of our phylogenetic knowledge will languish in the pages of journals.

One of my biggest concerns is the growing gap between how many phylogenies are published each year, and those that make it into the best known phylogenetic database we have, TreeBASE. This graph shows the cumulative growth of publications on phylogenetics, based on the number of papers found in the Web of Science by searching on the key words “molecular” and “phylogenetic” since 1981 the growth of the TreeBASE, which launched in 1996 (a study in TreeBASE is equivalent to a single paper). The idea for this diagram came from Mark Pagel's article "Inferring the historical patterns of biological evolution" (doi:10.1038/44766).


iPhylo is where I hope to write (or perhaps more correctly, "brain dump") ideas on phyloinformatics, with special reference to my current hobby-horse -- the lack of a decent database of evolutionary trees. As well as thoughts, ideas, suggestions, and the occasional rant, I hope to be able to report on some progress along the way. Stay tuned...