Wednesday, November 28, 2007

Transitive reduction

Quick note to self, having stumbled on the Wikipedia page on transitive reduction. Given a graph like this:

the transitive reduction is:

Note that the original graph has an edge a -> d, but this is absent after the reduction because we can get from a to d via b (or c).

What's the point? Well, it occurs to me that a quick way of harvesting information about existing taxonomies (e.g., if we want to assemble an all embracing classification of life to hep navigate a database) is to make use of the titles of taxonomic papers, e.g., the title Platyprotus phyllosoma, gen. nov., sp. nov., from Enderby Land, Antarctica, an unusual munnopsidid without natatory pereopods (Crustacea: Isopoda: Asellota) gives us:

Crustacea -> Isopoda -> Asellota ->Platyprotus -> phyllosoma

From the paper, and or other sources we can get paths such as Asellota -> Munnopsididae -> Platyprotus and Isopoda -> Munnopsididae. Imagine that we have a set of these paths, and want to assemble a classification (for example, we want to grow the Species 2000 classification, which lacks this isopod). Here's the graph:

This clearly gives us information on the classification of the isopod, but it's not a hierarchy. The transitive reduction, however, is:

It would be fun to explore using this technique to mine taxonomic papers and automate the extraction of classifications, as well as names.

Tuesday, November 20, 2007


Paulo Nuin recently interviewed me for his Blind.Scientist blog. The interview is part of his SciView series.

Friday, November 16, 2007

Thesis online

One side effect of the trend towards digitising everything is that stuff one forgot about (or, perhaps, would like to forget about) comes back to haunt you. My alma mater, the University of Auckland is digitising theses, and my PhD thesis "Panbiogeography: a cladistic approach" is now online (hdl:2292/1999). Here's the abstract:

This thesis develops a quantitative cladistic approach to panbiogeography. Algorithms for constructing and comparing area cladograms are developed and implemented in a computer program. Examples of the use of this software are described. The principle results of this thesis are: (1) The description of algorithms for implementing Nelson and Platnick's (1981) methods for constructing area cladograms. These algorithms have been incorporated into a computer program. (2) Zandee and Roos' (1987) methods based on "component-compatibility" are shown to be flawed. (3) Recent criticisms of Nelson and Platnick's methods by E. O. Wiley are rebutted. (4) A quantitative reanalysis of Hafner and Nadler's (1988) allozyme data for gophers and their parasitic lice illustrates the utility of information on timing of speciation events in interpreting apparent incongruence between host and parasite cladograms. In addition the thesis contains a survey of some current themes in biogeography, a reply to criticisms of my earlier work on track analysis, and an application of bootstrap and consensus methods to place confidence limits on estimates of cladograms.

1990. Ah, happy days...

Thursday, November 15, 2007

Phyloinformatics workshop online

Slides from the recent Phyloinformatics workshop in Edinburgh are now online at the e-Science Institute. In case the e-Science Institute site disappears I've posted the slides on slideshare.

Heiko Schmidt has also posted some photos of the proceedings, demonstrating how distraught the particpants were that I couldn't make it.

Thursday, November 08, 2007

GBIF data evaluation

Interesting paper in PLoS ONE (doi:10.1371/journal.pone.0001124) on the quality of data housed in GBIF. The study looked at 630,871 georeferenced legume records in GBIF, and concluded that 84% of these records are valid. As examples of those that aren't, below is a map of legumes placed in the sea (there are no marine legumes).

Although the abstract warns of the dire consequences of data deficencies, the conclusions make for interesting reading:

The GBIF point data are largely correct: 84% passed our conservative criteria. A serious problem is the uneven coverage of both species and areas in these data. It is possible to retrieve large numbers of accurate data points, but without appropriate adjustment these will give a misleading view of biodiversity patterns. Coverage associates negatively with species richness. There is a need to focus on databasing mega-diverse countries and biodiversity hotspots if we are to gain a balanced picture of global biodiversity. A major challenge for GBIF in the immediate future is a political one: to negotiate access to the several substantial biodiversity databases that are not yet publicly and freely available to the global science community. GBIF has taken substantial steps to achieve its goals for primary data provision, but support is needed to encourage more data providers to digitise and supply their records.