Monday, February 02, 2009

Wiki modelling - Part 2

Modelling taxa is a bit trickier. I've sketched my ideas for distinguishing name strings and taxonomic names earlier. That's the easy stuff. What about "taxonomic concepts" and "OTUs"? As a first pass, I'm looking at linking taxon names to classifications via GUIDs. If a taxon appears in a classification then the GUID of the corresponding node in the classification is an attribute of the taxon name, and each classification GUID (representing a node in a classification) corresponds to a page in the Wiki.



I claim no originality for this, it's very close to how uBio models classification concepts:



The trick here is going to be ensuring that I can do sensible queries, such as linking a node in a classification to alternative names.

The other entity that I need to think carefully about are OTUs (Operational Taxonomic Units). By OTUs I mean the taxa that appear in phylogenetic trees. In the TbMap project I mapped TreeBASE taxa to names in external databases, but noted that TreeBASE taxa are better thought of as OTUs:

...many taxon names in TreeBASE are best though of as Operational Taxonomy Units (OTUs) rather than taxonomic names. They identify a set of observations for a particular specimen, set of specimens, or a taxon. For instance, "Eleutherodactylus crassidigitus FMNH257676 Panama" (TaxonID T51971) refers to a 1200 base pair stretch of mitochondrial DNA (AY273113) obtained from Field Museum Natural History specimen FMNH 257676, which has been identified as Eleutherodactylus crassidigitus. [see doi:10.1186/1471-2105-8-158.

Taxa in phylogenetic trees may be single sequences, multiple sequences (from one or more specimens), or aggregates of information from multiple taxa. The challenge is to model these in the simplest way that reflects this, but also makes queries feasible. What I'm aiming for is for the user to click on a node in a phylogeny, and be taken to a page that best corresponds to the entity in the tree, but at the same time enable queries that will list all phylogenies that contain a given taxon.

3 comments:

Mike Keesey said...

I had a similar dilemma with taxa and taxonomic units in my current project. Recently I've decided to simply say that a "taxon" is some kind of set and I don't care what the elements are (organisms, species, genes, populations, whatever). Also, I'm considering taxonomic names, OTU/HTU labels, specimen identifiers, and character descriptions to all be the same thing: ways of signifying taxa. That cut out a lot of chaff in my class schema (here if you're interested). Of course, what's chaff for my purposes may not be for yours.

Roderic Page said...

Mike,

Thanks for the links. My own (still poorly formed) view is that taxonomic names are tags, which are applied to a range of entities (such as specimens, sequences, observations, sets of taxa, etc.). Some of these entities are quite different things (e.g., it makes sense for a specimen to have a point location). Ultimately I'm more interested in the underlying data and links -- the names themselves are convenient labels.

The thing which struck me when I first started playing with relational databases some years ago is the realisation that many things I thought should be stored were actually better thought of as queries (or, more precisely, query results). I tend to view taxa in the same way -- results of queries. It's been a long day, so I'm not sure if this is making sense,

Mike Keesey said...

"I tend to view taxa in the same way -- results of queries. It's been a long day, so I'm not sure if this is making sense"

It makes sense to me from an operational viewpoint. Taxa are sets, and what's the equivalent of a set when it comes to databases? Query results.

Not sure if it's quite what you mean, but I have some posts on SQL queries yielding information about taxa here and here.

By "point location" do you mean geographical coordinates? If so, I don't see why all of those entities couldn't have coordinates. If not ... what's a point location?

Anyway, I have to admit that even in my schema, where one class represents every kind of taxon signifier, from clade to kingdom to species to specimen to character state, I do still find some need to associate them with semi-ad hoc categories (nomen, specimen, OTU, etc.). Oh well, no schema's perfect.