iPhylo: December 2008

Roderic D. M. Page

Friday, December 19, 2008

Code prettyfier


class Voila {
public:
  // Voila
  static const string VOILA = "Voila";
  // will not interfere with embedded tags.
}

Thursday, December 18, 2008

Failure

Success is the ability to go from failure to failure without losing your enthusiasm -- Winston Churchill

I learnt today that my Elsevier Challenge entry didn't make the final cut. This wasn't unexpected. In the interests of "open science" (blame Paulo Nuin) here is the feedback I received from the judges:

Strengths
Beautiful presentation, lovely website. Page clearly made his case for open access to metadata/full articles in order to allow communities to build the tools they want. The judges would have enjoyed seeing more elements from the original abstract (tree of life). Great contribution so far to the discussion; Page made his point very well.

Weaknesses
Given that no specific tool was proposed, this submission is somewhat out of scope for the competition. Nonetheless, in support of his point, Page could have elaborated on the kinds of open formats and standards for text and data and figures that would support integrated community-wide tool-building. Alternatively, if the framework and the displayed functionalities were to be the submission, there could have been more discussion of how others can integrate their plug-ins and make them cross-referential to the plug-ins of others. The proposal for Linked Data should utilize Semantic Web standards

Elements to Consider for Development
How many, and which types of, information substrates? How much work for a new developer to create a new one, and to make this work? How to incentivize authors to produce the required metadata? Or to make the data formats uniform?

I think this is a pretty fair evaluation of my entry. I was making a case for what could be done, rather than providing a specific bit of kit that could make this happen right now. I think I was also a little guilty of not following the "underpromise but overdeliver" mantra. My original proposal included harvesting phylogenies from images, and that proved too difficult to do in the time available. I don't think having trees would have ultimately changed the result (i.e., not making the cut), but it would have been cool to have them.

Anyway, time to stomp around the house a bit, and be generally grumpy towards innocent children and pets. Congratulations to the ~~bastards~~ fellow contestants who made it to the next round.

Tuesday, December 16, 2008

Reading books

One advantage of flying to the US is the chance to do some reading. At Newark (EWR) I picked up Guy Kawasaki's "Reality Check", which is a fun read. You can get a flavour of the book from this presentation Guy gave in 2006.

While at MIT for the Elsevier Challenge I was browsing in the MIT book shop and stumbled across "Google and the Myth of Universal Knowledge" by Frenchman Jean-Noël Jeanneney. It's, um, very French. I have some sympathy with his argument, but ultimately it comes across as European whining about American success. And the proposed solution involves that classic European solution -- committees! In many ways it's really a librarian complaing about Google (again), which librarians just need to get over:

OK, I'm not really doing the arguments justice, but I'm getting a little tired of European efforts that are essentially motivated by "well the Americans are doing this, so we need to do something as well."

Lastly, I also bought Linda Hill's "Georeferencing: The Geographic Associations of Information", which is a little out of date (what, no Google Maps or Google Earth?), but is nevertheless an interesting read, and has lots of references to georeferencing in biodiversity informatics. Given that my efforts for the challenge in this area where so crude, it's something I need to think about a bit more deeply.

Now, if I can just find my gate...

Elsevier Challenge

Stata Center. MIT
Originally uploaded by Roderic Page

Quick post about the Elsevier Challenge, which took place yesterday in the wonderful Stata Center at MIT. It was a great experience. Cool venue, interesting talks, probing questions (having a panel of judges ensured that everybody got feedback/queries). Some talks (like mine) were more aspirational (demos of what could be done), others, such as Sean O'Donoghue's talk on Reflect, and Stephen Wan's on CSBIS (see "In-Browser Summarisation: Generating Elaborative Summaries Biased Towards the Reading Context") were systems that Elsevier could plug in to their existing Science Direct product (and hence are my picks to go forward to the last round).

I was typically blunt in my talk, especially about how useless Science Direct's "2collab" and "Related articles" features were. Rafael Sidi is not unsympathetic to this, and I think despite their status as the Microsoft of publishing (for the XBox crowd, that's a Bad Thing™), the Elsevier people at the meeting were genuinely interested in changing things, and exploring how best to disseminate knowledge. There's hope for them yet! Oh, and special thanks to Anita de Ward and Noelle Gracy for organising the meeting, and the smooth running of the Challenge.

Death throes of Cladistics

I'm in the US on UK time, so this is probably a bad idea to write this, but the paper by Malte Ebach et al. ("O Cladistics, Where Art Thou?", doi:10.1111/j.1096-0031.2008.00225.x) in the latest Cladistics just annoys me too much. Rather than the call to arms that the authors intend, I think they've provided one more example of the death throes of cladistics (in the narrow parsimony is all, statistical methods are evil, molecular systematics is phenetics, barcoding is killing taxonomy sense).

Associations, such as the Willi Hennig society, and journals, such as Cladistics, were erected in order to tackle the growing problem of pheneticists, purveyors of overall similarity, clustering and divergence rates. Rather than challenge molecular systematists and their numerical taxonomic methods, we take part. Where is our integrity?

Gosh, maybe people realised that molecular data are useful, that molecular data benefit from statistical analysis, and that divergence rates (and times) were of great biological interest? Fancy that!

What happened to the Cladistic Revolution? Today, students appear to have no knowledge of that Revolution. They graduate as students did so before the Revolution, with a sound knowledge of phenetics, ancestor worship and a healthy dose of molecular genetics. What happened to taxonomy and cladistics?

I suspect the real drivers in the "Revolution" were: the development methods that could be implemented in computer software (I include parsimony in this); computer hardware that was becoming cheaper and more powerful; and, the growth of molecular data (i.e., data that was easily digitised). I don't mean to imply that everything was technologically driven, but I suspect it was a combination of a desire to infer evolutionary trees coupled with plausible means of doing so that drove the "revolution", rather than any great conceptual framework.

Such matters as the Phylocode, DNA taxonomy and barcoding, for example, have risen to prominence despite criticism of their many ﬂaws and illogical conclusions. The attempts of these applied technologies to derail almost 250 years of scholarship are barely even questioned by our own peers with only a few taking a stand (e.g., Will and Rubinoﬀ, 2004; Wheeler, 2005).

Barcoding is happening, get over it. There are technical issues with its ability to identify "species", but to object to it on ideological grounds (as papers published in Cladistics tend to do) is ultimately futile. If the authors dealt with bacteria they wouldn't bat an eyelid. Besides, I suspect that the ability to identify organisms, or discover clusters of similar sequences will be among the least interesting applications of barcoding. There will be a wealth of standardised, geotagged data from across life around the planet. People not blinkered by ideology will do interesting things with these data.

Barcoding is understood as a ‘‘solution’’ (to what, one might ask?), systematics journals are infested with phenetics and population genetics (cladistics has vanished), both, seemingly, directing the course and future of taxonomy. Where are the scholars?

Personally I use the term "phenetics" as a litmus test. If anybody says that a method is "phenetic" then I pretty much switch off. Almost always, if somebody uses this term they simply don't understand what they are talking about. If you describe a method as "phenetic" then that tells me that you either don't understand the method, or you're too lazy to try and understand it.

In some ways all this saddens me. I was an undergraduate student around the time of the heyday of the New York school, thought Systematics and Biogeography: Cladistics and Vicariance was a great (if flawed) book (and I still do), and did my first post doc with Gary Nelson at the AMNH. It was a great time to be a student. Phylogenetic trees were appearing in all sorts of places, and systematists were tackling big topics such as biogeography, diversification, coevolution, and development. There was a sense of ambition, and excitement. Yet now it seems that Cladistics has become a venue for reactionary rants by people unable to break out of the comforting (but ultimately crippling) coherence of the hard-core cladist's world view.

Saturday, December 13, 2008

EOL hyperbole

The latest post on the EOL blog (Biodiversity in a rapidly changing world) really, really annoys me. It claims that

The case of the red lionfish exemplfies how EOL can provide information for science-based decision making. Red lionfish are native to coral reef ecosystems in the Indo-Pacific. Yet, probably due to human release of the fish from aquariums, a large population has found itself in the waters near the Bahamas.

Nope, I suggest it demonstrates just how limited EOL is. If I view the page for the red lionfish I get an out of date map from GBIF that shows a very limited distribution, and doesn't show the introductions in Florida and the Bahamas (I have to wade through text to find reference to the Florida introduction, and the page doesn't mention the Bahamas!). The blog entry states that

In this senerio[sic], EOL and its data partners provide up to date information about the lionfish, or pterois[sic] volitans, in a species page.

Well, the GBIF map is old (a more recent map is available from GBIF itself), the bibliography omits key references such as "Biological invasion of the Indo-Pacific lionfish Pterois volitans along the Atlantic coast of North America" (useful reading for a "science-based decision", one would think). Most of this information I got from Wikipedia, GBIF, and Google Scholar via an iSpecies search.

In other words, EOL in it's present state is serving limited, out of date information. The gap between hype and delivery shows no sign of narrowing. How can this help "science-based decision making"? Surely there will come a point when people will tire of breathless statements about how EOL will be useful, and they will start to ask "where's the beef?"

Thursday, December 11, 2008

Yes We Can - "scientists are the ultimate remixers"

The Science Commons has released a short video by Jesse Dylan, who made the Yes We Can video.

Tuesday, December 09, 2008

Stack Overflow

Quick note to say how much I like the programmers' Q & A site Stack Overflow. I've only asked two questions, but the responses have been rapid and useful.

I found out about Stack Overflow by listening to the Stack Overflow podcast episodes on IT Conversations (which carry a lot of other podcasts as well).
For a wannabe geek, these podcasts are a great source of ideas.

Table lens view of data matrix

Among the many weaknesses of my challenge demo is the way it simply dumps out a list of sequences (see comments on the demo. I decided to take a look at table lens after reading BiblioViz: a system for visualizing bibliography information -- see also Rao and Card's 1994 paper (doi:10.1145/191666.191776, there is a free PDF on Ramono Rao's web site), and DateLens (another product of the University of Maryland's Human -Computer Interaction Lab, who also gave us treemaps). I've hacked together some crude Javascript and CSS, taking some suggestions on Stack Overflow as a starting point (seems to work in Safari and Firefox, doesn't in IE6).

The idea is to display a table in a fixed space. As you mouse over a cell, the contents of the cell, and the relevant row and column labels become visible. This enables you to get an overview of the full table, but still see individual items:

It's easier to show than explain. For example, take a look at The amphibian tree of life, or watch this short screencast:

There are some things to fix. Firstly, I group all sequences by NCBI taxon and gene "features". If there's more than one sequence for the same gene and taxon, I just show one of them (an obvious solution is to add a popup menu if there's more than one sequence). Secondly, the gene "names" are extracted from GenBank feature tables, and will include synonyms and duplicates (for example, a sequence may have a gene feature "RAG-1" and a CDS feature "recombination activating protein 1"). I've stored all of these as not every sequence is consistently labelled, so excluding one class of feature may loose all labels from a sequence. At some point it would be useful to cluster gene names (a task for another day).