Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed. ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.
In the wiki examples I've been developing I've been trying to model names using the TDWG LSID vocabularies, particularly TaxonName. Roger Hyam has obviously put a huge amount of work into developing these, and they handle just about everything I need. However, I think that there's one thing missing, namely a way to express the logical relationship between the parts of a multinomial taxonomic name.
For example, consider the fish Chromis circumaurea Pyle, Earle, and Greene, 2008, described by Rich Pyle and colleages (TED have recently posted a great video of Rich talking about discovering new species of fish). Chromis circumaurea is a species in the genus Chromis, and in the TaxonName vocabulary I can represent this relationship using the term "genusPart", which specifies the name of the genus. In a wiki page this could be a link to a page called "Chromis".
But, which "Chromis"? There are at least three:
Chromis Hübner 1819
Chromis Lacepède 1802
Chromis Cuvier, 1814
Only one of these is the fish (Chromis Cuvier, 1814). Cases of the same name being used for different organisms (homonymy) is not uncommon, so linking to strings isn't adequate to express the relationship between the two parts of the name Chromis circumaurea.
I'd alluded to this issue in my first major foray into RDF and taxonomic names (Taxonomic names, metadata, and the Semantic Web), where I proposed using the Dublin Core term "isPartOf" to link the specific epithet to the genus part. In this case, the link would be between URIs for the names Chromis circumaurea Pyle, Earle, and Greene, 2008 and Chromis Cuvier, 1814.
It's a small point, but without some means to link components of a name we're going to struggle to sensibly answer questions such as listing all the species in a given genus (or, perhaps more correctly, all the species names that have been published in a given genus).
As part of my Quixotic attempt to construct a wiki of taxonomic names, I'm building a database of names and links. My current plan is to seed this with the NCBI taxonomy. What I want to do is flesh out the NCBI taxonomy with authorities and links to the original literature. At the moment the NCBI taxonomy is almost "nude", lacking links to the literature behind the names. As the magnificently bearded Geoffrey Bilder notes in an interview with Martin Fenner:
One way in which researchers assess the trustworthiness of content is by determining how it sits within the scholarly record. Does it provide evidence for its assertions in citations? Do other people cite it?
Given how important the NCBI taxonomy is, I think it would be a great improvement if each name could be linked to the original taxonomic publication. A first step to this is to find the taxonomic authority, the name of the author (or authors) of the name.
One potential source is uBio, which provides web services for retrieving information on names. Hence, an obvious approach is to map NCBI names to uBio names. However, if I use uBio's SOAP service typically I get multiple records for the same name. Some of these are due to homomyms (e.g., the same name used for a plant and an animal), but many are the same name with variations on the taxonomic authority. Much of this variation arises because uBio aggregates information from a wide range of databases, and each database differs in who it records the taxonomic authority.
For example, for the name "Diplura" (which I've discussed earlier) we get these names and authorities:
Diplura (Greene MS.) Allman 1864
Diplura Borner, 1904
Diplura C. L. Koch 1850
Diplura G. J. Hollenberg, 1969
Diplura Hollenb.
Diplura Jerdon 1864
Diplura Koch 1850
Diplura Koch 1851
Diplura Rambur 1866
Diplura Simon 1892
Before asking which of these names corresponds to "Diplura" in NCBI, I'd like to cluster these names into sets by merging names that are "the same." This resembles the problem of equivalent author names. The approach I'm using is to build a graph linking taxonomic authorities that are more similar than some threshold, then finding the components of that graph. For example, here is the graph for "Diplura": The nodes in the graph are the taxonomic authorities, "cleaned" by making all the text lower case, and stripping any punctuation. The edges are labelled by the length of the longest common substring shared by the nodes that edge links (I ignore substrings less than four characters long). This graph groups the variations on Diplura Koch (a spider), and Diplura Hollenberg (a brown alga, see doi:10.1111/j.1529-8817.1969.tb02617.x).
Not surprisingly, perhaps, the linkouts from the NCBI taxonomy for Diplura are a mess, with the algal genus (taxon:371965 linking to both plant and animal databases, and the insect class (taxon:29997) linking to a mix of plants and animals, not all of the animals are insects.
I'm still playing with the underlying code, but I might try and build a web service that returns name clusters (and perhaps the graph as well).
It will be a busy day as I'm also talking at the British Library in the evening (6pm - 8:30pm), for which Sarah Kemmitt has produced a flyer, and set up a discussion forum on Nature Network. With all this effort going into the artwork, I'd better actually come up with something useful to say.
Reading a recent TAXACOM thread (Species Pages - purpose) my sense is that some people are arguing that "species pages" would be time consuming to create, aren't much good for taxonomists (to quote Mike Dallwitz "In brief, to make simplified and attractive information about taxa easily available to casual users?"), and nobody gets credit for making them. In short, "they're not for me, I don't get credit for making them, so why bother?"
Others (e.g., Doug Yanega) see species pages -- properly constructed -- to be a research tool. If we extend this to its logical conclusion, we could envisage these pages being the primary source of information on taxa. Indeed, new taxa could be described in this way. In short, "this is the future of taxonomic publication".
One obvious way to realise species pages sensu Doug Yanega is using a wiki, but then there are those that horrified by the prospect of just "anyone" being able to edit that content. In short, "wikis are not for serious people, the ignorant might mess up my stuff". Others have had a more positive experience.
I realise this doesn't do justice to these positions, but this make things a little concrete, I've put together a demo based on a wiki I'm constructing. The aim of this wiki is to link together taxonomic names, specimens, images, classifications, publications, phylogeny, people in one place. It's a bit like a wiki version of my Elsevier Challenge entry.
This is some way off being ready for prime time, but I thought it might be useful to show the sort of thing that can be done.
As a starting point, http://itaxon.org/wikidev/Chromis_circumaurea is a page about Chromis circumaurea, one of the fish Rich Pyle et al recently described in Zootaxa. This page contains a map and some specimen images, and an abbreviated description copied from the Zootaxa article. The images and the map are generated automaticaly by the wiki, based on the links it has to the specimens, e.g.:
What I hope this crude example demonstrates is a framework where we can support all the kinds of objects we care about, and easily create links between them that can generate useful information. For example, the page for Chromis circumaurea doesn't explicitly list the images shown, they are there because of the links between Chromis circumaurea, the specimens, and the images of those specimens. The same applies to the map. What this means is that very little information needs be entered, it's mostly a matter of joining the dots.
Note that these wiki pages already have more information than either iSpecies or EOL.
This example has been assembled by hand, but much of the data required can be entered automatically (e.g., for sequences, specimens, publications, etc.), and tools such as text mining or XML markup (e.g., TaxonX) could be easily exploited. I also realise that as it stands the demo has very limited information about the organism itself, but I don't this this as intractable.
If we could build things like this (and I believe we can, with a lot less effort than might be thought), the question becomes this the kind of "species page" that would be useful?
Another issue I'm trying to get my head around is how to deal with labels in phylogenies. These can be any number of things, such as GenBank sequences, specimen codes, taxon names, abbreviations of taxon names, laboratory codes, etc. Here's my quick attempt to model these: This sketches various levels of indirection to go from a label in a tree to a taxon name. The may be short form of a taxon name (one redirect to name), it may contain a specimen code (redirect to specimen code, then link to name), or it may be GenBank sequence (redirect to accession number, then via source to taxon name with corresponding NCBI taxon id). There are other cases to consider, such as synonyms, but I'll try to deal with these later. At this stage I'm looking at how to make it simple to query for all phylogenies that contain a given taxon.
I rather skirted around the notion of "taxonomic concepts" in the previous post, partly because it's easy to end up with trying to have a concept for each utterance every made by a taxonomist, and that doesn't seem, er, scalable. So, I have a more limited view of a taxonomic concept, namely a name attached to some data. For example, I think the NCBI Taxonomy provides useful taxonomic concepts, in that names are explicitly linked to data, such as sequences: Having data means we can make inferences that have some basis, other than trying to figure out what a taxonomist "meant".
However, things start to get a little messy once I try and extract more information out of NCBI GenBank. Some time ago I pointed out the potential utility of host association records in GenBank. In some (many?) cases the host taxa won't be in GenBank, so the link will be between DNA sequence and taxon name. This is, of course, a simplification. It would be nice to model things more accurately. For example, a parasite will typically be obtained from a host organism, so it might be nice if, say, we had voucher specimen codes for both parasite and host, and could model the link as one between organisms (or samples of/from organisms). However, this is unlikely to be feasible in most cases, hence we have sequences linked to names:
Modelling taxa is a bit trickier. I've sketched my ideas for distinguishing name strings and taxonomic names earlier. That's the easy stuff. What about "taxonomic concepts" and "OTUs"? As a first pass, I'm looking at linking taxon names to classifications via GUIDs. If a taxon appears in a classification then the GUID of the corresponding node in the classification is an attribute of the taxon name, and each classification GUID (representing a node in a classification) corresponds to a page in the Wiki.
The trick here is going to be ensuring that I can do sensible queries, such as linking a node in a classification to alternative names.
The other entity that I need to think carefully about are OTUs (Operational Taxonomic Units). By OTUs I mean the taxa that appear in phylogenetic trees. In the TbMap project I mapped TreeBASE taxa to names in external databases, but noted that TreeBASE taxa are better thought of as OTUs:
...many taxon names in TreeBASE are best though of as Operational Taxonomy Units (OTUs) rather than taxonomic names. They identify a set of observations for a particular specimen, set of specimens, or a taxon. For instance, "Eleutherodactylus crassidigitus FMNH257676 Panama" (TaxonID T51971) refers to a 1200 base pair stretch of mitochondrial DNA (AY273113) obtained from Field Museum Natural History specimen FMNH 257676, which has been identified as Eleutherodactylus crassidigitus. [see doi:10.1186/1471-2105-8-158.
Taxa in phylogenetic trees may be single sequences, multiple sequences (from one or more specimens), or aggregates of information from multiple taxa. The challenge is to model these in the simplest way that reflects this, but also makes queries feasible. What I'm aiming for is for the user to click on a node in a phylogeny, and be taken to a page that best corresponds to the entity in the tree, but at the same time enable queries that will list all phylogenies that contain a given taxon.
Time to make some notes. I've been playing with using Sematic Mediawiki to create a database of taxonomic names, literature, specimens, sequences, and phylogenies. One challenge is to come up with simple ways to model these entities, in a way that makes both data entry simple and querying as simple as possible. Some things are straightforward. For example, a publication can be modelled like this: OK, I've ignored the attributes. The diagram simply shows the use of MediaWiki REDIRECT to enable the use of standard publication GUIDs as Wiki page names (see earlier posts for more details, and a hack to deal with problem characters in DOIs). One benefit of GUID REDIRECTs is that I can refer to publications using GUIDs, and the wiki user will be taken to the article page without any fuss.
Likewise, we can model a journal like this: Again, GUIDs are REDIRECT pages. This means an article page can have the ISSN of the publication it appears in as one of its attributes, and we can then use ISSNs in our queries.
People are a bit trickier, given the absence of GUIDs (or the desire to keep obvious ones, such as email addresses, private) (see doi:10.1371/journal.pcbi.1000247 for some background). I plan to have a single page for each author, and have alternative spellings link to that page:
This is one motivation for my work on equivalent author names. By finding clusters of equivalent names it would be possible to pre-populate the wiki with author names from bibliographic databases, whilst minimising the number of duplicate pages for the same author.
In conjunction with the TV show, the Wellcome Trust has launched the Interactive Tree of Life, a Flash-based view of the tree of life. There's also a blog about the project. Here's a demo of the tree:
The tree looks very nice, and a lot of work has gone into it, but I am somewhat underwhelmed. The tree itself is tiny, and does a poor job of conveying the relative diversity of life (e.g., no plants, bacteria, few arthropods, etc.). It displays the tree on a 2D plane, and the user can move relative to that plane. I'm not convinced this is the best way to display large trees. Something modelled on Perceptive Pixel's demo might be more useful. I blogged about this last year, but the video host service has disappeared. You can see the tree display 50 seconds in to the video below:
Out of curiosity I grabbed the code from the web site (a 1.5Gb file) and had a quick look. The bulk of the files are media, such as images, movies, and 3D Maya models. There's some nice stuff here. The actual tree itself is there in New Hampshire eXtended format. Here it is displayed in TreeView X: