Tuesday, January 29, 2008

Metacrap

Time for a rant. I spend a lot of time fussing with records from sources such as GenBank and DiGIR providers, trying to extract strings that might be identifiers, with a view to linking sequences to specimens (and thus to localities), sequences to publications, publications to GUIDs, etc. The goal is to be able to link all these resources together, so that I can go seamlessly from a phylogeny in TreeBASE to a locality on a map, and visa versa, of from a phylogeny for a parasite to that of its host, for example.

Time and again I come across extraordinarily frustrating messes. More and more I see why Cory Doctorow wrote Metacrap. According to this essay, the problems with metadata are:
  1. People lie
  2. People are lazy
  3. People are stupid
  4. Mission: Impossible -- know thyself
  5. Schemas aren't neutral
  6. Metrics influence results
  7. There's more than one way to describe something

Amen. I know I've moaned about this before (e.g., Damn DiGIR, and AMNH Dspace and Openurl), but it's just amazing what people put into databases.

Co-ordinates

My current favourite is the junk that populates GenBank source records. This is where people provide information about the source of their sequence (e.g., the organism it came from, any voucher specimens, and the geographic locality of the source). All manner of junk ends up here.

You'd think that geographic location was a straightforward thing to record, but no. Consider sequence AY281248. The country field for this record is:
Australia: Gubbata, NSW (GPS: 33 38' 07'', 146  33' 12'')

Spot the problem? Leaving aside the fact that somebody has to parse this and extract the latitude and longitude (hmm, which one is which, can I rely on latitude coming first, ah the second must be longitude because latitude is never > 90°), the co-ordinates are not in Australia.


View Larger Map

Last time I checked, Australia was in the southern hemisphere, and not drifting of the coast of Japan. Now, latitude 33°38' 07''S is in Australia:


View Larger Map

Seems like a small deal to leave off "S" (or a minus sign), but changing hemispheres is a big deal. Then there are strings such as
Nicaragua: Rio San Juan, Near Isla de Diamante 
(ca. 15 km SE El Castillo on Rio San Juan), 10deg56'N 84deg18'W

from DQ502492. Sigh.

Now, you might say that this isn't GenBank's fault, because the country field wasn't supposed to have latitude and longitude information, that is the role of the lat_lon field. This format for this field is defined as:
degrees latitude and longitude in format "d[d.dd] N|S d[dd.dd] W|E"

Fine, but people don't always follow this format. For example, DQ226041 has
/lat_lon="6 28.06'N; 58 37.16'W"

Not hugely different, but the ";" needs to be removed before parsing. What's worse, I can't assume people will follow GenBank guidelines about how to store data in these fields.

Specimen codes

What I'm really after are specimen codes, as these can potentially be linked to digital records served by natural history collections. But, once again all manner of variations end up in GenBank records. For example, specimens in the Field Museum can be referred to as Field Museum of Natural History 167358 (DQ023431) or as FMNH 167358 ( AY324464 ), and there are other variations I've come across as well. Furthermore, FMNH 167358 isn't enough by itself to retrieve a record from the Field Museum, I need to know the address of the DiGIR provider, and whether the specimen is a mammal or some other vertebrate (museum specimen codes are often not unique across the institution). In my opinion the single biggest failure of specimen digitisation efforts is the lack of a globally unique, resolvable identifier. Sure, it's coming, but in the meantime we have chaos.

Tools like GBIF show us what can be done when we aggregate lots of specimen records, but I'd argue the real fun starts when we link resources together, and we can only do this is we have stable, shared identifiers. As an example, Steppan et al. (2003) (doi:10.1111/j.1095-8312.2003.00274.x) published a study of Apomys biogeography. If I get a sequence from this study from GenBank (e.g., AY324464 mentioned above) I have a GenBank record with very few links. There's no PubMed record, for example. If I use an OpenURL resolver (such as the one I wrote for bioGUID) I can retrieve a DOI from the citation Biol. J. Linn. Soc. Lond. 80 (4), 699-715 (2003). This gives me a GUID for the paper, as well as a way to link to the text. The specimen code FMNH 167358 can be used to retrieve details from the Field Museum (once I've parsed the GenBank taxonomy string to discover that Apomys datae is a mammal). This gives me the latitude and longitude of the locality from which the specimen was collected, so I can put it on a map:


View Larger Map

So, I've added information to the GenBank record. But this isn't really the goal. What I want to do is link these elements together. For example, here's a graph showing the link between the publication, the sequence, and the specimen:



Basically, this is the immediate neighbourhood of the sequence AY324464. If I get the neighbourhood of the specimen:



Note that this includes another sequence, DQ023431 (the "tag" refers to the uBio record for Apomys datae -- I treat taxonomic names as tags, but that's another story). I could then navigate to DQ023431, and then onto the publication that refers to that sequence, and so on. Very quickly we get a web of data. This is, of course, the vision of the Semantic Web and related projects, such as Linking Open Data. Wonderful idea, it's just such hard work getting there...

2 comments:

Kevin said...

Hahaha i totally agree with u on the coords part being frustrating... as they say garbage in garbage out..

sexy said...

情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣,情趣,情趣,情趣,情趣,情趣,情趣,情趣,A片,視訊聊天室,聊天室,視訊,視訊聊天室,080苗栗人聊天室,上班族聊天室,成人聊天室,中部人聊天室,一夜情聊天室,情色聊天室,視訊交友網

免費A片,AV女優,美女視訊,情色交友,免費AV,色情網站,辣妹視訊,美女交友,色情影片,成人影片,成人網站,A片,H漫,18成人,成人圖片,成人漫畫,情色網,日本A片,免費A片下載,性愛

A片,色情,成人,做愛,情色文學,A片下載,色情遊戲,色情影片,色情聊天室,情色電影,免費視訊,免費視訊聊天,免費視訊聊天室,一葉情貼圖片區,情色,情色視訊,免費成人影片,視訊交友,視訊聊天,視訊聊天室,言情小說,愛情小說,AIO,AV片,A漫,avdvd,聊天室,自拍,情色論壇,視訊美女,AV成人網,色情A片,SEX,成人論壇

情趣用品,A片,免費A片,AV女優,美女視訊,情色交友,色情網站,免費AV,辣妹視訊,美女交友,色情影片,成人網站,H漫,18成人,成人圖片,成人漫畫,成人影片,情色網


情趣用品,A片,免費A片,日本A片,A片下載,線上A片,成人電影,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,微風成人區,成人文章,成人影城,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,臺灣情色網,色情,情色電影,色情遊戲,嘟嘟情人色網,麗的色遊戲,情色論壇,色情網站,一葉情貼圖片區,做愛,性愛,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,美女交友,做愛影片

av,情趣用品,a片,成人電影,微風成人,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,愛情公寓,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,aio,av女優,AV,免費A片,日本a片,美女視訊,辣妹視訊,聊天室,美女交友,成人光碟

情趣用品.A片,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,色情遊戲,色情網站,聊天室,ut聊天室,豆豆聊天室,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,免費A片,日本a片,a片下載,線上a片,av女優,av,成人電影,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,成人網站,自拍,尋夢園聊天室