iPhylo: September 2007

Roderic D. M. Page

Thursday, September 27, 2007

Mesquite does Google Earth files

The latest version of the David and Wayne Maddison's Cartographer module for their program Mesquite can export KML files for Google Earth. They graciously acknowledge my crude efforts in this direction, and Bill Piel's work -- he really started this whole thing rolling.

So, those of you inspired to try your hand at Google Earth trees, and who were frustrated by the lack of tools should grab a copy of Mesquite and take it for a spin.

Wednesday, September 19, 2007

Parallels

Quick note to say how much fun it is to use Parallels Desktop. It's a great advantage to have Windows XP and Fedora Core 7 running on my Mac. As much as I dislike Internet Explorer, it caught some bugs in my code. It's always useful to try different environments when debugging code, either stand alone or for the Web.

Tuesday, September 18, 2007

Nature Precedings

Nature precedings is pre-publication server launched by Nature a few months ago. To quote from the website:

Nature Precedings is a place for researchers to share pre-publication research, unpublished manuscripts, presentations, posters, white papers, technical papers, supplementary findings, and other scientific documents. Submissions are screened by our professional curation team for relevance and quality, but are not subjected to peer review. We welcome high-quality contributions from biology, medicine (except clinical trials), chemistry and the earth sciences.

Unable to resist, I've uploaded three manuscripts previously languishing as "Technical Reports" on my old server. The three I uploaded now have bright shiny DOIs, which may take a little while to register with CrossRef. The manuscripts are:

Treemap Versus BPA (Again): A Response to Dowling doi:10.1038/npre.2007.1030.1 (a response to a critique of my ancient TreeMap program).

On The Dangers Of Aligning RNA Sequences Using "Conserved" Motifs doi:10.1038/npre.2007.1029.1 (a short note on Hickson et al.'s (2000) use of conserved motifs to evaulate RNA alignment).

Towards a Taxonomically Intelligent Phylogenetic Database doi:10.1038/npre.2007.1028.1 (a paper written for DBiBD 2005, basically a rewrite of a grant proposal).

All three are under the evolution and ecology subject heading. Visitors to Nature precedings can comment on papers, and vote for which ones they like. The fact that I've uploaded some manuscripts probably says nothing good about me, but I'll be checking every so often to see if anybody has anything to say...

Tuesday, September 11, 2007

Matching names in phylogeny data files

In an earlier post I described the TBMap database (doi:10.1186/1471-2105-8-158), which contains a mapping of TreeBASE taxon names onto names in other databases. While this is one step towards making it easier to query TreeBASE, what I'd really like is to link the data in TreeBASE to sources such as GenBank and specimen databases. Part of the challenge of doing this (and also doing it more generally, such as taking a NEXUS file from the web and using that) is that the names people use in a NEXUS data file are often not the actual taxonomic names. So, if I take a table from a paper that lists GenBank accession numbers, voucher specimens, etc., I'm left with the problem of matching two sets of names -- those in the data file to those in the table.

For example, this is a TreeBASE taxon Apomys_sp._B_145699. Using a script I grabbed the sequences in this study and constructed a table listing the name and specimen voucher for each sequence. The name corresponding to the TreeBASE taxon is Apomys "sibuyan b" FMNH 145699. Clearly a similar string, but not the same.

The approach I've taken is to compare strings using the longest common subsequence algorithm. Below are the two strings, with their longest common subsequence highlighted in green.

Apomys_sp._B_145699

Apomys "sibuyan b" FMNH 145699

The length of this subsequence is used to compute a measure of distance between the two strings. If len1 is the length of the first string, len2 is the length of the second string, and lcs is the length of the longest common subsequence, then

d = (len1 + len2) - 2 × lcs

We can normalise this by dividing by len1 + len2, so that d ranges from 0 (identical) to 1.0 (no similarity).

So, now we have a measure of how similar the strings are, I still need to find the set of matchings between file and table names. This can be modelled an example of a maximum weight bipartite matching problem. Given a bipartite graph, we want to find the matching with the highest weight. A "matching" is where each node is connected to just one other node. For example, given this graph:

a maximum weight matching is:

In this example, four pairs of nodes are matched, one is "orphaned". Applying this to the data matching problem, I take a list of names form the NEXUS file, and a list of names from the paper (or supplementary data file, etc.) and compute a maximum weight matching. Because I'm looking for the maximum weighted matching I actually want similarity of names, hence I subtract the normalised d from 1.0.

So, the algorithm consists of taking the two lists of names (taxa from dataset and names from a table), computing the distance between all pairs of names, then obtain the maximum weight bipartite matching. Because for large sets of names the n × m the bipartite graph becomes huge, and because in practice most matches are poor, for each node in the graph I only draw the edges corresponding to the five most similar pairs of strings.