Wednesday, November 26, 2008

Sequencing Carmen Electra


One byproduct of playing with the Challenge Demo is that I come across some rather surprising results. For example, the rather staidly titled "Cryptic speciation and paraphyly in the cosmopolitan bryozoan Electra pilosa—Impact of the Tethys closing on species evolution" (doi:10.1016/j.ympev.2007.07.016) starts to look a whole lot more interesting given the taxon treemap (right).

The girl is Carmen Electra, which is understandable given the Yahoo image search was for "Electra" (a genus of bryozoan). However, what are the wild men (and women) doing at the top? Turns out this is the result of searching for the genus Homo. But why, you ask, does a paper on bryozoans have human sequences? Well, looks like the table in the paper has incorrect GenBank accession numbers. The sequences AJ711044-50 should, I'm guessing, be AJ971044-50.

Ironically, although it was Carmen Electra's photo that initially made me wonder what was going on, it's really the hairy folks above her image that signal something is wrong. I've come across at least one other example of a paper citing an incorrect sequences, so it might be time to automate this checking. Or, what is probably going to be more fun, looking at treemaps for obviously wrong images and trying to figure out why.

3 comments:

Javier de la Torre said...

There are two issues on your post:

1) Incorrect Genbank accession numbers
2) Using Yahoo, Flickr or Google to find pictures for scientific names is not ideal.

Regarding the second I have the same problem when trying to generate taxonomic trees with images. I developed a little project to ask people to help me identify pictures that are actually good for representing a taxon. You can see it at:
http://biodivertido.blogspot.com/2008/10/identifying-good-images-on-google-cache.html

If you want I can give you access to the API on top of this. You might get better results there, but still is very immature.

I will probably work on the next months on this topic as I find it crucial for good taxonomic browsers in the future.

Can I also ask you how are you processing the data? Are you using any map reduce algorithm?

The submission for the challenge demo looks great. I would maybe have displayed the geospatial data as a HeatMap. If the demo is about "inspiring" that could help :)

Rod Page said...

Javier,

Your tool looks nice. It would be useful to have it accept images form other sources, especially Flickr, which has images with machine tags that (presumably) are reasonably accurately identified. It would also be useful to be able to specify the names being searched for. THe other issue, which I have rather glossed over, is licensing. I think there's an underlying assumption that it's OK to display thumbnails.

Re map reduce, no, nothing so fancy. I'm basically resolving identifiers, converting the resulting data to JSON, finding any identifiers within that and fleshing them out, then importing the result. Basically a linear harvest, albeit with an intermediate cache so that I can go and fix any major errors in either my harvesting, or the underlying data. I tend to do things in the simplest, most obvious way.

Matt said...

Science, so crazy! Carmen, and who can forget Ward "Terrain Cycle" Wheeler (http://iphylo.org/~rpage/challenge/www/uri/685114f223a79cb7998fc5fe5e66f161#H), Ms. Belly Dancing Rana (http://iphylo.org/~rpage/challenge/www/uri/8c7f93f2e1884d1ff8dbf3a5357288bf), or Kate "Perfect Shave" Darling (http://iphylo.org/~rpage/challenge/www/uri/d20021033b061cde9d72c67ef475dc28) and my favourite professional hockey player/bird systematist E.S. "John" Tavares (http://iphylo.org/~rpage/challenge/www/uri/1d42b733741f1892bc7daee688e6c0aa). To use a Page euphonism, guessing about identity still "sucks", and this is all without people purposely trying to bait the system. Nice start though! ;)