Wednesday, November 28, 2007

Transitive reduction

Quick note to self, having stumbled on the Wikipedia page on transitive reduction. Given a graph like this:

the transitive reduction is:

Note that the original graph has an edge a -> d, but this is absent after the reduction because we can get from a to d via b (or c).


What's the point? Well, it occurs to me that a quick way of harvesting information about existing taxonomies (e.g., if we want to assemble an all embracing classification of life to hep navigate a database) is to make use of the titles of taxonomic papers, e.g., the title Platyprotus phyllosoma, gen. nov., sp. nov., from Enderby Land, Antarctica, an unusual munnopsidid without natatory pereopods (Crustacea: Isopoda: Asellota) gives us:

Crustacea -> Isopoda -> Asellota ->Platyprotus -> phyllosoma

From the paper, and or other sources we can get paths such as Asellota -> Munnopsididae -> Platyprotus and Isopoda -> Munnopsididae. Imagine that we have a set of these paths, and want to assemble a classification (for example, we want to grow the Species 2000 classification, which lacks this isopod). Here's the graph:


This clearly gives us information on the classification of the isopod, but it's not a hierarchy. The transitive reduction, however, is:



It would be fun to explore using this technique to mine taxonomic papers and automate the extraction of classifications, as well as names.

Tuesday, November 20, 2007

Interview

Paulo Nuin recently interviewed me for his Blind.Scientist blog. The interview is part of his SciView series.

Friday, November 16, 2007

Thesis online


One side effect of the trend towards digitising everything is that stuff one forgot about (or, perhaps, would like to forget about) comes back to haunt you. My alma mater, the University of Auckland is digitising theses, and my PhD thesis "Panbiogeography: a cladistic approach" is now online (hdl:2292/1999). Here's the abstract:

This thesis develops a quantitative cladistic approach to panbiogeography. Algorithms for constructing and comparing area cladograms are developed and implemented in a computer program. Examples of the use of this software are described. The principle results of this thesis are: (1) The description of algorithms for implementing Nelson and Platnick's (1981) methods for constructing area cladograms. These algorithms have been incorporated into a computer program. (2) Zandee and Roos' (1987) methods based on "component-compatibility" are shown to be flawed. (3) Recent criticisms of Nelson and Platnick's methods by E. O. Wiley are rebutted. (4) A quantitative reanalysis of Hafner and Nadler's (1988) allozyme data for gophers and their parasitic lice illustrates the utility of information on timing of speciation events in interpreting apparent incongruence between host and parasite cladograms. In addition the thesis contains a survey of some current themes in biogeography, a reply to criticisms of my earlier work on track analysis, and an application of bootstrap and consensus methods to place confidence limits on estimates of cladograms.

1990. Ah, happy days...

Thursday, November 15, 2007

Phyloinformatics workshop online


Slides from the recent Phyloinformatics workshop in Edinburgh are now online at the e-Science Institute. In case the e-Science Institute site disappears I've posted the slides on slideshare.


Heiko Schmidt has also posted some photos of the proceedings, demonstrating how distraught the particpants were that I couldn't make it.

Thursday, November 08, 2007

GBIF data evaluation


Interesting paper in PLoS ONE (doi:10.1371/journal.pone.0001124) on the quality of data housed in GBIF. The study looked at 630,871 georeferenced legume records in GBIF, and concluded that 84% of these records are valid. As examples of those that aren't, below is a map of legumes placed in the sea (there are no marine legumes).

Although the abstract warns of the dire consequences of data deficencies, the conclusions make for interesting reading:

The GBIF point data are largely correct: 84% passed our conservative criteria. A serious problem is the uneven coverage of both species and areas in these data. It is possible to retrieve large numbers of accurate data points, but without appropriate adjustment these will give a misleading view of biodiversity patterns. Coverage associates negatively with species richness. There is a need to focus on databasing mega-diverse countries and biodiversity hotspots if we are to gain a balanced picture of global biodiversity. A major challenge for GBIF in the immediate future is a political one: to negotiate access to the several substantial biodiversity databases that are not yet publicly and freely available to the global science community. GBIF has taken substantial steps to achieve its goals for primary data provision, but support is needed to encourage more data providers to digitise and supply their records.

Wednesday, October 31, 2007

Amber spider

Really just a shameless attempt to get one over David Shorthouse, but there has been some buzz about Very High Resolution X-Ray Computed Tomography (VHR-CT) of a fossil of Cenotextricella simon.


The paper describing the work is in Zootaxa (link here). Zootaxa is doing great things for taxonomic publishing, but they really need to get some sort of stable identifier set up. Linking to ZooTaxa articles is not straightforward. If they had DOIs (or even OpenURL access) they wuld make it much easier for people to convert lists of papers that include ZooTaxa publications to lists of resolvable links.

Sunday, October 28, 2007

Universal Serial Item Names

Following on from the discussion of BHL and DOIs, I stumbled across some remarkable work by Robert Cameron at SFU. Cameron has developed Universal Serial Item Names (USIN). The approach is spelled out in detail in Towards Universal Serial Item Names (also on Scribd). This lengthy document deals with how to develop user-friendly identifiers for journal articles, books, and other documents. The solution looks less baroque than SICIs, which I've discussed earlier.

There is also a web site (USIN.org), complete with examples and source code. Identifiers for books are straightforward, for instance bibp:ISBN/0-86542-889-1 identifies a certain book:

For journals things are slightly more complicated. However, Cameron simplified things a little in his subsequent paper Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention (also on Scribd).
JACC (Journal Article Citation Convention) is proposed as an alternative to SICI (Serial Item and Contribution Identifier) as a convention for specifying journal articles in DOI (Digital Object Identifier) suffixes. JACC is intended to provide a very simple tool for scholars to easily create Web links to DOIs and to also support interoperability between legacy article citation systems and DOI-based services. The simplicity of JACC in comparison to SICI should be a boon both to the scholar and to the implementor of DOI responders.

USIN and JACC use the minimal number of elements in order to identifier an article, such as journal code (e.g., ISSN or an accepted acronym), volume number, and starting page. Using ISSNs ensures globally unique identifiers for journals, but the scheme can also use acronyms, hence those journals that lack ISSNs could be catered for. The scheme is simple, and in many cases will provide the bare minimum of information necessary to locate an item via an OpenURL resolver. Indeed, one simple way to implement USIN identifiers would be to have a service that takes URIs of the form <journal-code>:<volume>@<page> and resolves them behind the scenes using OpenURL. Hence we get simple identifiers that are resolvable, without the baroque approach of SICIs.

When I get the chance I may add support for something like this to bioGUID.

Saturday, October 27, 2007

Taxonomy is dead, long live taxonomy


No, not taxonomy the discipline (although I've given a talk asking this question), but taxonomy.zoology.gla.ac.uk, my long-running web server hosting such venerable software projects as TreeView, NDE, and GeneTree, along with my home page.

A series of power cuts in my building while I was away finally did for my ancient Sun Sparcstation5, running the CERN web server (yes, it's that old). I can remember the thrill (mixed with mild terror) of taking delivery of the Sparcstation and having to manually assemble it (the CD ROM and floppy drives came separately), and the painful introduction to the Unix command line. The joy of getting a web server to run (way back in late 1995), followed by Samba, AppleTalk, and CVS.

For the time being a backup copy of the documents and software hosted on the Sparcstation are being served from a Mac. The only tricky thing was setting up the CVS server that I use for version control for my projects. Yes, I know CVS is also ancient, and that Linus Torvalds will think me a moron, but for now it's what I use. CVS comes with Apple's developer tools, but I wanted to set up remote access. I found the articles by Daniel Côté Setting up a CVS server on Mac OS X and on Mac OSX Hints Enable CVS pserver on 10.2 to be helpful. Basically I initialised a new CVS repository, then copied across the backed repository from a DVD. I then replaced some files in CVSROOT that listed things like the modules in the repository and notifications sent when code is comitted. Getting the pserver up and running required some work. I created a file called cvspserver inside /etc/xinetd.d/, with the following contents.

service cvspserver
{
disable = no
socket_type = stream
wait = no
user = root
server = /usr/bin/cvs
server_args = -f --allow-root=/usr/local/CVS pserver
groups = yes
flags = REUSE
}

Then I started the service:
sudo /sbin/service cvspserver start

So far, so good, but I couldn't log in to CVS. Discovered that this is because Mac OS X uses ShadowHash authentication_authority. Hence, on a Mac CVS won't use the system user names and passwords (probably a good thing). Therefore, we uncomment the line
# Set this to "no" if pserver shouldn't check system users/passwords
SystemAuth=no

in the file CVSROOT/config, then create a file CVSROOT/passwd. This file contains the username, hash password, and the actual Mac OS X username (nicely explained in Daniel Côté's article). To generate a hash password, do this:

darwin: openssl passwd
Password: 123
Verifying - Password: 123
yrp85EUNQl01E

At last it all seems to work, and I can get back to coding. This is about as geeky as this blog gets, but if you want a real geek overload, spend some time listening to this talk by Linus Torvalds.

Thursday, October 25, 2007

BHL and DOis

In a series of emails Chris Freeland, David Shorthouse, and I have been discussing DOIs in the context of the Biodiversity Heritage Library (BHL). I thought it worthwhile to capture some thoughts here.
In an email Chris wrote:
Sure, DOIs have been around for a while, but how many nomenclators or species databases record them? Few, from what I've seen - instead they record citations in traditional text form. I'm trying to find the middle ground between guys like the two of you, who want machine-readable lit (RDF), and most everyone else I talk with, including regular users of Botanicus & BHL, who want human-readable lit (PDF). I'm not overstating - it really does break down into these 2 camps (for now), with much more weight over on the PDF side (again, for now).

I think the perception that there are two "camps" is unfortunate. I guess for a working taxonomist, it would be great if for a given taxonomic name there was a way to see the original publication of the name, even if it is simply a bitmap image (such as a JPEG). Hence, a database that links names to images of text would be a useful resource. If this is what BHL is aiming for, then I agree, DOIs may seem to be of little use, apart from being one way to address the issue of persistent identifiers.

But it seems to me that there are lots of tasks for which DOIs (or more precisely, the infrastructure underlying them) can help. For example, given a bibliographic citation such as

Fiers, F. and T. M. Iliffe (2000) Nitocrellopsis texana n. sp. from central TX (U.S.A.) and N. ahaggarensis n. sp. from the central Algerian Sahara (Copepoda, Harpacticoida). Hydrobiologia, 418:81-97.

how do I find a digital version of this article? Given this citation

Fiers, F. & T. M. Iliffe (2000). Hydrobiologia, 418:81.

how do I decide that this is the same article? If I want to see whether somebody has cited this paper (and perhaps changed the name of the copepod) how do I do that? If I want follow up the references in this paper, how do I do that?

These are the kinds of thing that DOIs address. This article has the DOI doi:10.1023/A:1003892200897. This gives me a globally unique identifier for the article. The DOI foundation provides a resolver whereby I can go to a site that will provide me with access (albeit possibly for a fee) to the article. CrossRef provides an OpenURL service whereby I can
  • Retrieve metadata about the article given the DOI

  • Given metadata I can search for a DOI

To an end user much of this is irrelevant, but to people building the links between taxonomic names and taxonomic literature, these are pressing issues. Previously I've given some examples before where taxonomic databases such as Cataloggue of Life and ITIS store only text citations, not identifiers (such as DOIs or Handles). As a result, the user has to search for each paper "by hand". Surely in an ideal world there would be a link to the publication? If so, how do we get there? How do IPNI, Index Fungorum, ITIS, Sp2000, ZooBank, and so on link their names and references to digitised content? This is where a CrossRef-style infrastructure comes in.

Publishers "get this". Given the nature of the web where users expect to be able follow links, CrossRef deals with the issue of converting the literature cited section of a paper into a set of clickable links. Don't we want the same thing for our databases of taxonomic names? And, don't we want this for our taxonomic literature?

It is worth noting that the perception that DOIs only cover modern literature is erroneous. For example, here's the description of Megalania prisca Owen (doi:10.1098/rstl.1859.0002), which was published in 1859. The Royal Society of London has DOIs for articles published in the 18th century.

If the Royal Society can do this, why can't BHL?

Monday, October 22, 2007

Phyloinformatics workshop - primal scream

Argh!!! The phyloinformatics workshop at Edinburgh's eScience Centre is underway (program of talks available here as an iCalendar file), and I'm stranded in Germany for personal reasons I won't bore readers with. The best and brightest gather less than an hour from my home town to talk about one of my favourite subjects, and I can't be there. Talk about frustration!

How can they they possibly proceed without yours truly to interject "it sucks" at regular intervals? What, things are going just fine? Next, you'll be suggesting that Systematic Biology can function without me as editor … wait, what's that you say? Jack's running the show without a hitch … gack, I'm redundant.

Monday, October 15, 2007

Getting into Nature ... sort of


The kind people at Nature have taken pity on my rapidly fading research career, and have highlighted my note "Towards a Taxonomically Intelligent Phylogenetic Database" in Nature Precedings (doi:10.1038/npre.2007.1028.1) on the Nature web site. Frankly this is probably the only way I'll be getting into Nature...

Sunday, October 14, 2007

Pygmybrowse GBIF classification

Here is a live demo of Pygmybrowse using the Catalogue of Life classification of animals provided by GBIF. It's embedded in this post in an <iframe> tag, so you can play with it. Just click on a node.

Taxa in bold have ten or more children, the numbers of children are displayed in parentheses "()". Each subtree is fetched on the fly from GBIF.

Friday, October 12, 2007

Pygmybrowse revisited


As yet another example of avoiding what I should really be doing, a quick note about a reworked version of PygmyBrowse (see earlier posts here and here). Last September I put together a working demo written in PHP. I've now rewritten it entirely in Javascript, apart from PHP script that returns information about a node in a classification. For example, this link returns details about the Animalia in ITIS.

You can view the new version live. Going "view source" in your browser will show you the code. It's mostly a matter of Javacsript and CSS, with some AJAX thrown in (based on the article Dynamic HTML and XML: the XMLHttpRequest object on Apple's ADC web site).

One advantage of making it entirely in Javascript is that it can be easily integrated into web sites that don't use PHP. As an example, David Shorthouse has inserted a version into the species pages in The Neartic Spider database (for example, visit the page for Agelena labyrinthica and click on the little "Browse tree" link).

Thursday, October 11, 2007

Frog deja vu

While using iSpecies to explore some names (e.g., Sooglossus sechellensis in the Frost et al. amphibian tree mentioned in the last post, I stumbled across two papers that both described a new genus of frog for the same two taxa in the Seychelles. The papers (doi:10.1111/j.1095-8312.2007.00800.x and www.connotea.org/uri/6567cfd7531a77588ee62d78e7b4359b) were published within a couple of months of each other.

Was about to blog this, when I discovered that Christopher Taylor had beaten me to it with his post Sooglossidae: Deja vu all over again. Amongst the commentary on this post is a note by Darren Naish (now here) pointing to an interesting article by Jerald D. Harris entitled ‘Published Works’ in the electronic age: recommended amendments to Articles 8 and 9 of the Code, in which he states:
I propose to the Commission that, under Article 78.3 (‘Amendments to the Code’), Articles 8 and 9 of the current Code require both pro- and retroactive (to the effective date of the Fourth Edition, 1 January 2000) modification to accommodate the following issue: documents published electronically with DOI numbers and that are followed by hard-copy printing and distribution be exempt from Article 9.8 and be recognized as valid, citable sources of zoological taxonomic information and that their electronic publication dates be considered definitive.

It's an interesting read.

Visualising very big trees Part VI

I've tidied up the big phylogeny viewer mentioned earlier, and added a simple web form for anybody interested to upload a NEXUS or Newick tree and have a play.

Examples:


To create your own tree viewer, simply go to http://linnaeus.zoology.gla.ac.uk/~rpage/bigtrees/tv2/ and upload a tree. After some debugging code and images scroll past, a link to the widget appears at the bottom of the page. I'll tidy this all up when I get the chance, but for now it's good enough to play with.

Thursday, October 04, 2007

processing PhylOData (pPOD)


The first pPod workshop happened last month at NESCent, and some of the presentations are online on the pPod Wiki. Although I'm a "consultant" I couldn't be there, which is a pity because it looks to have been an interesting meeting. When pPod was first announced I blogged some of my own thoughts on phylogenetics databases. The first meeting had lots of interesting stuff on workflows and data integration, as well as outlining the problems faced by large-scale systematics. Some relevant links (partly here to as a reminder to myself to explore these further):

Thursday, September 27, 2007

Mesquite does Google Earth files


The latest version of the David and Wayne Maddison's Cartographer module for their program Mesquite can export KML files for Google Earth. They graciously acknowledge my crude efforts in this direction, and Bill Piel's work -- he really started this whole thing rolling.

So, those of you inspired to try your hand at Google Earth trees, and who were frustrated by the lack of tools should grab a copy of Mesquite and take it for a spin.

Wednesday, September 19, 2007

Parallels


Quick note to say how much fun it is to use Parallels Desktop. It's a great advantage to have Windows XP and Fedora Core 7 running on my Mac. As much as I dislike Internet Explorer, it caught some bugs in my code. It's always useful to try different environments when debugging code, either stand alone or for the Web.

Tuesday, September 18, 2007

Nature Precedings


Nature precedings is pre-publication server launched by Nature a few months ago. To quote from the website:
Nature Precedings is a place for researchers to share pre-publication research, unpublished manuscripts, presentations, posters, white papers, technical papers, supplementary findings, and other scientific documents. Submissions are screened by our professional curation team for relevance and quality, but are not subjected to peer review. We welcome high-quality contributions from biology, medicine (except clinical trials), chemistry and the earth sciences.

Unable to resist, I've uploaded three manuscripts previously languishing as "Technical Reports" on my old server. The three I uploaded now have bright shiny DOIs, which may take a little while to register with CrossRef. The manuscripts are:

Treemap Versus BPA (Again): A Response to Dowling doi:10.1038/npre.2007.1030.1 (a response to a critique of my ancient TreeMap program).

On The Dangers Of Aligning RNA Sequences Using "Conserved" Motifs doi:10.1038/npre.2007.1029.1 (a short note on Hickson et al.'s (2000) use of conserved motifs to evaulate RNA alignment).

Towards a Taxonomically Intelligent Phylogenetic Database doi:10.1038/npre.2007.1028.1 (a paper written for DBiBD 2005, basically a rewrite of a grant proposal).

All three are under the evolution and ecology subject heading. Visitors to Nature precedings can comment on papers, and vote for which ones they like. The fact that I've uploaded some manuscripts probably says nothing good about me, but I'll be checking every so often to see if anybody has anything to say...

Tuesday, September 11, 2007

Matching names in phylogeny data files

In an earlier post I described the TBMap database (doi:10.1186/1471-2105-8-158), which contains a mapping of TreeBASE taxon names onto names in other databases. While this is one step towards making it easier to query TreeBASE, what I'd really like is to link the data in TreeBASE to sources such as GenBank and specimen databases. Part of the challenge of doing this (and also doing it more generally, such as taking a NEXUS file from the web and using that) is that the names people use in a NEXUS data file are often not the actual taxonomic names. So, if I take a table from a paper that lists GenBank accession numbers, voucher specimens, etc., I'm left with the problem of matching two sets of names -- those in the data file to those in the table.

For example, this is a TreeBASE taxon Apomys_sp._B_145699. Using a script I grabbed the sequences in this study and constructed a table listing the name and specimen voucher for each sequence. The name corresponding to the TreeBASE taxon is Apomys "sibuyan b" FMNH 145699. Clearly a similar string, but not the same.

The approach I've taken is to compare strings using the longest common subsequence algorithm. Below are the two strings, with their longest common subsequence highlighted in green.


Apomys
_sp._B_145699


Apomys "sibuyan b" FMNH 145699

The length of this subsequence is used to compute a measure of distance between the two strings. If len1 is the length of the first string, len2 is the length of the second string, and lcs is the length of the longest common subsequence, then

d = (len1 + len2) - 2 × lcs

We can normalise this by dividing by len1 + len2, so that d ranges from 0 (identical) to 1.0 (no similarity).

So, now we have a measure of how similar the strings are, I still need to find the set of matchings between file and table names. This can be modelled an example of a maximum weight bipartite matching problem. Given a bipartite graph, we want to find the matching with the highest weight. A "matching" is where each node is connected to just one other node. For example, given this graph:


a maximum weight matching is:



In this example, four pairs of nodes are matched, one is "orphaned". Applying this to the data matching problem, I take a list of names form the NEXUS file, and a list of names from the paper (or supplementary data file, etc.) and compute a maximum weight matching. Because I'm looking for the maximum weighted matching I actually want similarity of names, hence I subtract the normalised d from 1.0.

So, the algorithm consists of taking the two lists of names (taxa from dataset and names from a table), computing the distance between all pairs of names, then obtain the maximum weight bipartite matching. Because for large sets of names the n × m the bipartite graph becomes huge, and because in practice most matches are poor, for each node in the graph I only draw the edges corresponding to the five most similar pairs of strings.