Saturday, May 06, 2006

Ants, RDF, and triple stores


Background
In order to explore the promise of RDF and triple stores we need some large, interesting data sets. Ants are cool, there is a lot of data available online (e.g., AntWeb at the California Academy of Sciences, Antbase at the American Museum of Natural History, New York, and the Hymenoptera Name Server at Ohio State University, Chris Schmidt's ponerine.org, and Ant News), and they get good press (for example, the "Google ant").

Specimens


Firstly, we start with a Google Earth file for ants, obtained from AntWeb on Monday April 24th, 2006. AntWeb gives the link as http://www.antweb.org/antweb.kmz, which is a compressed KML file. However, this file merely gives the location for the actual data file, http://www.antweb.org/doc.kmz. Grab that file, expand it and you get 27 Mb of XML listing 50,550 ant specimens and 1,511 associated images.

We use the Google Earth file because it gives us a dump of AntWeb, and does it in a reasonably easy to handle format (XML). I wrote a C++ program to parse the KML file and dump the information to RDF. One limitation is that my program dies on the recent KML files because they have very long lines. Doing a search on <Placemark> and replacing it with \r<Placemark> in TextWrangler fixed the problem.

In order to keep things as simple and as generic as possible, I use Dublin Core metadata terms wherever possible, and the basic geo (WGS84 lat/long) vocabulary for geographical coordinates. The URI for the specimen is the URL (no LSIDs just yet).

In addition to the RDF, I generate two text dumps for further processing.

Images
As noted at iSpecies, we can automate the extraction of metadata from images using Exif tags.There is a vocabulary for describing Exif data in RDF, which I've adopted. However, I don't use all the tags, nor do I use IFD, which frankly I don't understand.

So, the basic idea is to have a Perl script that:

  1. Takes a list of AntWeb images (more preciesly, the URLs for the images)

  2. Fetches each image in turn using LWP and writes them to a temporary folder

  3. Uses Image::EXIF to extract Exif tags

  4. Generate RDF


Some AntWeb specific things include linking the image to the specimen, and linking to a Creative Commons license.
Here is an example:

<rdf:Description rdf:about="http://www.antweb.org/images/casent0005842/casent0005842_p_1_low.jpg" >
<dc:subject rdf:resource="http://www.antweb.org/specimen.do?name=casent0005842" />
<dc:type>image</dc:type>
<dc:publisher rdf:resource="http://www.antweb.org/"/>
<dc:format>image/jpeg</dc:format>
<exif:resolutionUnit>inches</exif:resolutionUnit>
<exif:yResolution>337.75</exif:yResolution>
<exif:imageHeight>64</exif:imageHeight>
<exif:imageWidth>112</exif:imageWidth>
<exif:xResolution>337.75</exif:xResolution>
</rdf:Description>
</rdf:RDF>



This RDF is generated from the image to the right. What is interesting about the Exif metadata is that it isn't generated from the AntWeb database itself, but from the images. Hence, unlike the Goggle Earth file, we are adding value rather than simply reformatting an existing resource.

Of course, there are some gotchas. Some images look like this ("image not available"), and the Exif tag Copyright has a stray null character (\0x00) appended at the end, which breaks the RDF. Fixed this by Zap gremlins in TextWrangler.

Names
There is no single authorative list of scientific names. I'm going to use the biggest (and best), uBio, specifically their findIT SOAP service. It might make more sense to use the Hymenoptera Name Server, but uBio serves RDF, and gets most off the ant names anyway as the Hymenoptera Name Server feeds names into ITIS, which in turn end up in uBio. The result of this mapping is a <dc:subject> tag for each specimen that links using rdf:resource to a uBio LSID. When we make the mapping, we write the uBio namebank ids to a separate file, which we then process to get the RDF for each name.
The script reads a list of specimens and taxon names, calls uBio's findIT SOAP service, and if it gets a direct match, writes some RDF linking the specimen URI to the uBio LSID. It also stores the uBio id in memory, and dumps these into a file for processing in the next step.

uBio metadata

Having mapped ant names to uBio, we can then go to uBio and use their LSID authority to retrieve metadata for each name in, you guessed it, RDF. We could resolve LSIDs, but for speed I'll "cheat" and append the uBio namebank ID to http://names.ubio.org/authority/metadata.php?lsid=.
So, armed with a Perl script we read the list of uBio ids, fetch the metadata for each one and dump it into directory. I then run another Perl script that scans a directory for ".rdf" files and puts them in the triple store.

NCBI

Sequences
I retrieved all ant sequences from GenBank by searching the taxonomy browser for Formicidae, downloading all the sequence gis, then running a Perl script that retrieved the XML record for each sequence and populated a MySQL database. I then found all sequences that include a specimen voucher field with CASENT%:


SELECT DISTINCT dbxref.id FROM
specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "GI")

Next, we fetch these records from NCBI. This seems redundant as we have the information already in a local MySQL database, but I want to use a simple script that takes a GI and outputs RDF so that anybody can do this.

Names
In much the same way, I grabbed the TaxIds for ants with sequences, and grabbed RDF for each name.

PubMed
For PubMed records I wrote a simple Perl script that, given a list of PubMed identifiers, retrieves the XML record from NCBI and converts it to RDF using a XSLT style sheet. The script also gets the identifiers for any sequence linked to that PubMed record using elinks, and uses the <dcterms:references> tag to model the relationship. For the ant project I only use PubMed ids for papers that include sequences that have CASENT specimens:

SELECT DISTINCT dbxref.id FROM
specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "PUBMED")


Turns out there are only three such papers:

16601190

16214741

15336679


FORMIS
We could add bibliographic data from FORMIS, which can be searched online here, and downloaded as EndNote files. This would be "fun" to convert to RDF.

PubMed Central
This search finds all papers on Formicidae in PubMed Central, which we could use as an easy source of XML data, in some cases with full text and citation links.

Triple store
The beauty of a triple store is that we can import all these RDF documents into a single store and query them. It doesn't matter that we have information about images in one file, information about specimens in another, and information about names in yet another file. If we use URIs consistently, it all comes together. This is data integration made easy.

Query

This RDQL query finds all images for Melissotarsus insularis


SELECT ?image WHERE
(?taxon, <dc:subject>, "Melissotarsus insularis")
(?specimen, <dc:subject>, ?taxon)
(?image, <dc:subject>, ?specimen)
(?image, <dc:type>,"image")
USING dc FOR <http://purl.org/dc/elements/1.1/>


Which returns two images: .

OK, now for something a little more fun. The Smith et al. barcoding paper that surveyed ants in Madagascar has PubMed id 16214741 (this paper also has the identifier doi:10.1098/rstb.2005.1714). Given this id (recast as a LSID urn:lsid:ncbi.nlm.nih.gov.lsid.zoology.gla.ac.uk:pubmed:16214741), we can find the geographic localities the authors sampled from using this query:


SELECT ?lat, ?long WHERE
(?nuc, <dcterms:isReferencedBy>, <urn:lsid:ncbi.nlm.nih.gov.lsid.zoology.gla.ac.uk:pubmed:16214741>)
(?nuc, <dc:source>, ?specimen)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
USING dc FOR <http://purl.org/dc/elements/1.1/>
dcterms FOR <http://purl.org/dc/terms/>
geo FOR <http://www.w3.org/2003/01/geo/wgs84_pos#>


which gives four localities:


?lat ?long
"-13.263333" "49.603333"
"-13.464444" "48.551666"
"-13.211666" "49.556667"
"-14.4366665" "49.775"


We can also search our triple store using other identifiers, such as DOIs:


SELECT ?lat, ?long WHERE
(?pubmed, <dc:identifier>, <doi:10.1098/rstb.2005.1714>)
(?nuc, <dcterms:isReferencedBy>, ?pubmed)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
USING dc FOR <http://purl.org/dc/elements/1.1/>
dcterms FOR <http://purl.org/dc/terms/>
geo FOR <http://www.w3.org/2003/01/geo/wgs84_pos#>


is the same query as above, but uses the DOI for the barcoding paper.

New inferences

One thing I noticed early on is that there are specimens that have been barcoded and which are labelled in GenBank as unidentified (i.e., they have names like "Melissotarsus sp. BLF m1"), but the same specimen has a proper name in AntWeb (e.g., casent0107665-d01 is Melissotarsus insularis). Assuming the identification is correct (a big if), we can then use this information to add value to GenBank. For example, a search of GenBank for sequences for Melissotarsus insularis find nothing, but it does have sequences for this taxon, albeit under the name "Melissotarsus sp. BLF m1".

This query searches the triple store for specimens that are named differently in AntWeb and GenBank. Often both names are not proper names, but represent different ways of saying "we don't know what this is". But in some cases, the specimen does have a proper name attached to it:


SELECT ?specimen, ?ident, ?name WHERE
(?specimen, <dc:type>, "specimen")
(?specimen, <dc:subject>, ?ident)
(?nuc, <dc:source>, ?specimen)
(?nuc, <dc:subject>, ?taxid)
(?taxid, <dc:type>, "Scientific Name")
(?taxid, <dc:title>, ?name)
AND ?ident ne ?name
USING dc FOR <http://purl.org/dc/elements/1.1/>



Currently playing in iTunes: One by Mary J Blige & Bono. Currently playing in iTunes: Crazy (Single Version) by Gnarls Barkley

No comments: