iPhylo: Taxonomy on a hard disk

Roderic D. M. Page

Friday, June 05, 2009

Taxonomy on a hard disk

This post is likely to seem somewhat off the wall, given the rush to getting everything in the cloud, but it's Friday, so let's give it a whirl.

One idea I've been toying with is dispensing with relational databases, wikis, etc. and just storing taxonomic data using files and folders on a disk. There are several reasons for this:

File system naturally enforces hierachy

There are existing systems for putting files and folders under version control (e.g., CVS, Subversion, git)

Native text and image editors handily beat web-based ones

Some file systems have great tools for searching on metadata (e.g., "smart folders" and Spotlight on Mac OS X)

Some of the visualisations that we would like for classifications (such as treemaps) already exist in very polished form for viewing file systems

By way of background, I've been prompted to think along these lines by David Shorthouse's observation that we could place a taxonomic hierachy under version control (e.g., github) and deal with changes/multiple versions that way. I've also been inspired by tools such as couchdb, which is a schema-less database that one can talk to directly via HTTP. This reflects a trend where people are starting to exploit the untapped power in some basic, well-known technologies (such as the HTTP protocol), avoiding the need for lots of middle-ware in the process (why write stuff in Java/Ruby/PHP, etc. when HTTP GET, PUT, POST, etc. cover the bases?). Another inspiration is Dropbox, which enables replication of files across multiple machines and the web. The web interface to Dropbox is very clean, and essentially mirrors the local folder structure.

So, in some ways this probably sounds silly, and closely resembles the naive way many of us started making digital versions of taxonomies, and it will have many database people rolling their eyes and muttering about "data consistency" and "queries". But, a key thing to remember is that the file system is a database that resides under a graphical user interface, and it maintains some forms of consistency that classical relational databases are poor at handling. For example, file systems enforce hierarchical consistency (if I move a folder to another folder, all the files and folders below that folder move as well). Of course, we can program this with a relational database, but our track record in doing this is pretty miserable. I've found inconsistencies in versions of ITIS (haven't checked recently), and last years' Catalogue of Life database had all sorts of orphans lurking in the tree table.

Then there's the GUI. If I write a taxonomic database in the classical way, I need to write code to talk to the database, edit records, support user authentication, data versioning, etc. If I use the file system, I get this pretty much for free. Authentication? It's called the login screen. Versioning? I put it in a public repository like Google Code or github, and that takes care of that (plus I get online authentication for free). Editing? Well, I can drag and drop items onto folders, and I can open them in native editors.

What I envisage is replicating a taxonomic hierarchy on disk, and representing key-value pairs of attributes (such as taxon name authorship, bibliographic details) as text files where the name of the file is the key (e.g., publishedIn) and the content of the file is the value (e.g., doi:10.1590/S0101-81752005000300004). I could add images and PDFs, and the neat thing is that they have lots of useful metadata embedded inside (where, arguably, it belongs).

I'm also toying with the idea of using symbolic links (Windows users, look away now) to represent relationships such as basionym links to original names.

This is all a bit half-baked at present, but it seems worth pursing. One could argue that having a full taxonomic hierachy is overkill (and raises the issue of which one to use), but binomial names are themselves hierachical (species epithet nested inside genus name), so we need some degree of hierarchy anyway. I like the idea that copying a folder called "behreae" in the folder "Pinnixa" and placing the copy under "Austinixa", then within Austinixa/behreae adding a symbolic link to Pinnixa/behreae pretty much takes care of synonomy. I also like the idea that one could download an entire taxonomy, and using just the native tools on your computer, edit and annotate it, then merge changes with a remote copy. It makes mamnaging the data little different from writing code.

In practise we'll want to add some things. It would be nice to have a web interface for browsing, but this could be as trivial as having a script that read the contents of a folder, display folders as HTML links, and list the files (keys) and their contents (values) in the web page.

Perhaps this is a little silly, but I like the idea of having data on my machine that is trivially easy to edit. I also like the idea of getting functionality for free, rather than having to invent it from scratch.