Monday, February 02, 2009

Wiki modelling - Part 3

I rather skirted around the notion of "taxonomic concepts" in the previous post, partly because it's easy to end up with trying to have a concept for each utterance every made by a taxonomist, and that doesn't seem, er, scalable. So, I have a more limited view of a taxonomic concept, namely a name attached to some data. For example, I think the NCBI Taxonomy provides useful taxonomic concepts, in that names are explicitly linked to data, such as sequences:

Having data means we can make inferences that have some basis, other than trying to figure out what a taxonomist "meant".

However, things start to get a little messy once I try and extract more information out of NCBI GenBank. Some time ago I pointed out the potential utility of host association records in GenBank. In some (many?) cases the host taxa won't be in GenBank, so the link will be between DNA sequence and taxon name. This is, of course, a simplification. It would be nice to model things more accurately. For example, a parasite will typically be obtained from a host organism, so it might be nice if, say, we had voucher specimen codes for both parasite and host, and could model the link as one between organisms (or samples of/from organisms). However, this is unlikely to be feasible in most cases, hence we have sequences linked to names: