Saturday, May 06, 2006

Ants, RDF, and triple stores


Background
In order to explore the promise of RDF and triple stores we need some large, interesting data sets. Ants are cool, there is a lot of data available online (e.g., AntWeb at the California Academy of Sciences, Antbase at the American Museum of Natural History, New York, and the Hymenoptera Name Server at Ohio State University, Chris Schmidt's ponerine.org, and Ant News), and they get good press (for example, the "Google ant").

Specimens


Firstly, we start with a Google Earth file for ants, obtained from AntWeb on Monday April 24th, 2006. AntWeb gives the link as http://www.antweb.org/antweb.kmz, which is a compressed KML file. However, this file merely gives the location for the actual data file, http://www.antweb.org/doc.kmz. Grab that file, expand it and you get 27 Mb of XML listing 50,550 ant specimens and 1,511 associated images.

We use the Google Earth file because it gives us a dump of AntWeb, and does it in a reasonably easy to handle format (XML). I wrote a C++ program to parse the KML file and dump the information to RDF. One limitation is that my program dies on the recent KML files because they have very long lines. Doing a search on <Placemark> and replacing it with \r<Placemark> in TextWrangler fixed the problem.

In order to keep things as simple and as generic as possible, I use Dublin Core metadata terms wherever possible, and the basic geo (WGS84 lat/long) vocabulary for geographical coordinates. The URI for the specimen is the URL (no LSIDs just yet).

In addition to the RDF, I generate two text dumps for further processing.

Images
As noted at iSpecies, we can automate the extraction of metadata from images using Exif tags.There is a vocabulary for describing Exif data in RDF, which I've adopted. However, I don't use all the tags, nor do I use IFD, which frankly I don't understand.

So, the basic idea is to have a Perl script that:

  1. Takes a list of AntWeb images (more preciesly, the URLs for the images)

  2. Fetches each image in turn using LWP and writes them to a temporary folder

  3. Uses Image::EXIF to extract Exif tags

  4. Generate RDF


Some AntWeb specific things include linking the image to the specimen, and linking to a Creative Commons license.
Here is an example:

<rdf:Description rdf:about="http://www.antweb.org/images/casent0005842/casent0005842_p_1_low.jpg" >
<dc:subject rdf:resource="http://www.antweb.org/specimen.do?name=casent0005842" />
<dc:type>image</dc:type>
<dc:publisher rdf:resource="http://www.antweb.org/"/>
<dc:format>image/jpeg</dc:format>
<exif:resolutionUnit>inches</exif:resolutionUnit>
<exif:yResolution>337.75</exif:yResolution>
<exif:imageHeight>64</exif:imageHeight>
<exif:imageWidth>112</exif:imageWidth>
<exif:xResolution>337.75</exif:xResolution>
</rdf:Description>
</rdf:RDF>



This RDF is generated from the image to the right. What is interesting about the Exif metadata is that it isn't generated from the AntWeb database itself, but from the images. Hence, unlike the Goggle Earth file, we are adding value rather than simply reformatting an existing resource.

Of course, there are some gotchas. Some images look like this ("image not available"), and the Exif tag Copyright has a stray null character (\0x00) appended at the end, which breaks the RDF. Fixed this by Zap gremlins in TextWrangler.

Names
There is no single authorative list of scientific names. I'm going to use the biggest (and best), uBio, specifically their findIT SOAP service. It might make more sense to use the Hymenoptera Name Server, but uBio serves RDF, and gets most off the ant names anyway as the Hymenoptera Name Server feeds names into ITIS, which in turn end up in uBio. The result of this mapping is a <dc:subject> tag for each specimen that links using rdf:resource to a uBio LSID. When we make the mapping, we write the uBio namebank ids to a separate file, which we then process to get the RDF for each name.
The script reads a list of specimens and taxon names, calls uBio's findIT SOAP service, and if it gets a direct match, writes some RDF linking the specimen URI to the uBio LSID. It also stores the uBio id in memory, and dumps these into a file for processing in the next step.

uBio metadata

Having mapped ant names to uBio, we can then go to uBio and use their LSID authority to retrieve metadata for each name in, you guessed it, RDF. We could resolve LSIDs, but for speed I'll "cheat" and append the uBio namebank ID to http://names.ubio.org/authority/metadata.php?lsid=.
So, armed with a Perl script we read the list of uBio ids, fetch the metadata for each one and dump it into directory. I then run another Perl script that scans a directory for ".rdf" files and puts them in the triple store.

NCBI

Sequences
I retrieved all ant sequences from GenBank by searching the taxonomy browser for Formicidae, downloading all the sequence gis, then running a Perl script that retrieved the XML record for each sequence and populated a MySQL database. I then found all sequences that include a specimen voucher field with CASENT%:


SELECT DISTINCT dbxref.id FROM
specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "GI")

Next, we fetch these records from NCBI. This seems redundant as we have the information already in a local MySQL database, but I want to use a simple script that takes a GI and outputs RDF so that anybody can do this.

Names
In much the same way, I grabbed the TaxIds for ants with sequences, and grabbed RDF for each name.

PubMed
For PubMed records I wrote a simple Perl script that, given a list of PubMed identifiers, retrieves the XML record from NCBI and converts it to RDF using a XSLT style sheet. The script also gets the identifiers for any sequence linked to that PubMed record using elinks, and uses the <dcterms:references> tag to model the relationship. For the ant project I only use PubMed ids for papers that include sequences that have CASENT specimens:

SELECT DISTINCT dbxref.id FROM
specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "PUBMED")


Turns out there are only three such papers:

16601190

16214741

15336679


FORMIS
We could add bibliographic data from FORMIS, which can be searched online here, and downloaded as EndNote files. This would be "fun" to convert to RDF.

PubMed Central
This search finds all papers on Formicidae in PubMed Central, which we could use as an easy source of XML data, in some cases with full text and citation links.

Triple store
The beauty of a triple store is that we can import all these RDF documents into a single store and query them. It doesn't matter that we have information about images in one file, information about specimens in another, and information about names in yet another file. If we use URIs consistently, it all comes together. This is data integration made easy.

Query

This RDQL query finds all images for Melissotarsus insularis


SELECT ?image WHERE
(?taxon, <dc:subject>, "Melissotarsus insularis")
(?specimen, <dc:subject>, ?taxon)
(?image, <dc:subject>, ?specimen)
(?image, <dc:type>,"image")
USING dc FOR <http://purl.org/dc/elements/1.1/>


Which returns two images: .

OK, now for something a little more fun. The Smith et al. barcoding paper that surveyed ants in Madagascar has PubMed id 16214741 (this paper also has the identifier doi:10.1098/rstb.2005.1714). Given this id (recast as a LSID urn:lsid:ncbi.nlm.nih.gov.lsid.zoology.gla.ac.uk:pubmed:16214741), we can find the geographic localities the authors sampled from using this query:


SELECT ?lat, ?long WHERE
(?nuc, <dcterms:isReferencedBy>, <urn:lsid:ncbi.nlm.nih.gov.lsid.zoology.gla.ac.uk:pubmed:16214741>)
(?nuc, <dc:source>, ?specimen)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
USING dc FOR <http://purl.org/dc/elements/1.1/>
dcterms FOR <http://purl.org/dc/terms/>
geo FOR <http://www.w3.org/2003/01/geo/wgs84_pos#>


which gives four localities:


?lat ?long
"-13.263333" "49.603333"
"-13.464444" "48.551666"
"-13.211666" "49.556667"
"-14.4366665" "49.775"


We can also search our triple store using other identifiers, such as DOIs:


SELECT ?lat, ?long WHERE
(?pubmed, <dc:identifier>, <doi:10.1098/rstb.2005.1714>)
(?nuc, <dcterms:isReferencedBy>, ?pubmed)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
USING dc FOR <http://purl.org/dc/elements/1.1/>
dcterms FOR <http://purl.org/dc/terms/>
geo FOR <http://www.w3.org/2003/01/geo/wgs84_pos#>


is the same query as above, but uses the DOI for the barcoding paper.

New inferences

One thing I noticed early on is that there are specimens that have been barcoded and which are labelled in GenBank as unidentified (i.e., they have names like "Melissotarsus sp. BLF m1"), but the same specimen has a proper name in AntWeb (e.g., casent0107665-d01 is Melissotarsus insularis). Assuming the identification is correct (a big if), we can then use this information to add value to GenBank. For example, a search of GenBank for sequences for Melissotarsus insularis find nothing, but it does have sequences for this taxon, albeit under the name "Melissotarsus sp. BLF m1".

This query searches the triple store for specimens that are named differently in AntWeb and GenBank. Often both names are not proper names, but represent different ways of saying "we don't know what this is". But in some cases, the specimen does have a proper name attached to it:


SELECT ?specimen, ?ident, ?name WHERE
(?specimen, <dc:type>, "specimen")
(?specimen, <dc:subject>, ?ident)
(?nuc, <dc:source>, ?specimen)
(?nuc, <dc:subject>, ?taxid)
(?taxid, <dc:type>, "Scientific Name")
(?taxid, <dc:title>, ?name)
AND ?ident ne ?name
USING dc FOR <http://purl.org/dc/elements/1.1/>



Currently playing in iTunes: One by Mary J Blige & Bono. Currently playing in iTunes: Crazy (Single Version) by Gnarls Barkley

1 comment:

sexy said...

情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣,情趣,情趣,情趣,情趣,情趣,情趣,情趣,A片,視訊聊天室,聊天室,視訊,視訊聊天室,080苗栗人聊天室,上班族聊天室,成人聊天室,中部人聊天室,一夜情聊天室,情色聊天室,視訊交友網

免費A片,AV女優,美女視訊,情色交友,免費AV,色情網站,辣妹視訊,美女交友,色情影片,成人影片,成人網站,A片,H漫,18成人,成人圖片,成人漫畫,情色網,日本A片,免費A片下載,性愛

A片,色情,成人,做愛,情色文學,A片下載,色情遊戲,色情影片,色情聊天室,情色電影,免費視訊,免費視訊聊天,免費視訊聊天室,一葉情貼圖片區,情色,情色視訊,免費成人影片,視訊交友,視訊聊天,視訊聊天室,言情小說,愛情小說,AIO,AV片,A漫,avdvd,聊天室,自拍,情色論壇,視訊美女,AV成人網,色情A片,SEX,成人論壇

情趣用品,A片,免費A片,AV女優,美女視訊,情色交友,色情網站,免費AV,辣妹視訊,美女交友,色情影片,成人網站,H漫,18成人,成人圖片,成人漫畫,成人影片,情色網


情趣用品,A片,免費A片,日本A片,A片下載,線上A片,成人電影,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,微風成人區,成人文章,成人影城,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,臺灣情色網,色情,情色電影,色情遊戲,嘟嘟情人色網,麗的色遊戲,情色論壇,色情網站,一葉情貼圖片區,做愛,性愛,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,美女交友,做愛影片

av,情趣用品,a片,成人電影,微風成人,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,愛情公寓,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,aio,av女優,AV,免費A片,日本a片,美女視訊,辣妹視訊,聊天室,美女交友,成人光碟

情趣用品.A片,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,色情遊戲,色情網站,聊天室,ut聊天室,豆豆聊天室,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,免費A片,日本a片,a片下載,線上a片,av女優,av,成人電影,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,成人網站,自拍,尋夢園聊天室