Monday, April 20, 2009

Accessing specimens using TAPIR or, why do we make this so hard?

OK, second rant of the day. One of my favourite online specimen databases is AntWeb. For a while the ability to harvest data from this database using the venerable DiGIR protocol hasn't been possible, due to various issues at the California Academy of Sciences. Well, now it's back, and "accessible" using TAPIR (TAPIR - TDWG Access Protocol for Information Retrieval). Accessible, that is, if you like horrifically over-engineered, poorly documented standards. OK, at lot of work has gone into TAPIR, there's lots of great code on SourceForge, and there's lots of documentation, but I've really struggled to get the most basic tasks done.

For example, let's imagine I and want to retrieve the information on the ant specimen CASENT0100367 (note how trivial this is via a web browser, just append the specimen name to http://www.antweb.org/specimen.do?name=). After much clenching of teeth struggling with the TAPIR documentation and the TAPIR client software, I finally found an email by Markus Döring that gave me the clue. If I'm going to construct a URL to retrieve this specimen record, I need to include the URL of an XML document that serves as a template for the query. Since one doesn't exist, I have to create it and make it accessible to the TAPIR server (i.e., the AntWeb TAPIR server needs to access it, so I have to place this XML document on my web server). The template (shown below) lives at http://bioguid.info/tapir/dwc_catalog_number.xml:

<?xml version="1.0" encoding="UTF-8"?>
<searchTemplate xmlns="http://rs.tdwg.org/tapir/1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xsi:schemaLocation="http://rs.tdwg.org/tapir/1.0
http://rs.tdwg.org/tapir/1.0/schema/tapir.xsd
http://www.w3.org/2001/XMLSchema
http://www.w3.org/2001/XMLSchema.xsd">
<label>Scientific name in query</label>
<documentation>Query for a Scientific Name. Based on http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml, found in email by Markus Döring http://lists.tdwg.org/pipermail/tdwg-tapir/2008-April/000493.html</documentation>
<externalOutputModel location="http://rs.tdwg.org/tapir/cs/dwc/1.4/model/dw_core_geo_cur.xml"/>
<filter>
<equals>
<concept id="http://rs.tdwg.org/dwc/dwcore/CatalogNumber" />
<parameter name="name"/>
</equals>
</filter>
</searchTemplate>

Now I can write my query: http://www.antweb.org/tapirlink/www/tapir.php/antweb
op=search
&start=0
&limit=1
&template=http://bioguid.info/tapir/dwc_catalog_number.xml
&name=casent0100367

So, the AntWeb server is going to read this query, and call my web server to get the query template to figure out what I actually want. Am I the only person who thinks that this is insane? Can anybody imagine going through these hoops to access a GenBank record, or a PubMed record?

Perhaps it's me, and my obsession with linking individual data records (rather than harvested lots of records, or federated search). But it strikes me that harvesting is a simple task and not many people will be doing it (at least, not on the scale of GBIF), and federated search is a non-starter as our community can't keep data providers online to save themselves.

In many ways I think TAPIR (and DiGIR before it) missed what for me is the most basic use case, namely I have a specimen identifier and I want to get the record for that specimen. These services make it much harder than it needs to be. It's a symptom of our field's inability to deliver simple tools that do basic tasks well, rather than overly general and highly complex tools that are poorly documented. Of course, retrieving individual records woud be easy if we have resolvable GUIDs for specimens, but we've singularly failed to deliver that, so we are stuck with very clunky tools. There's got to be a better way...

2 comments:

Unknown said...

Rod, if you follow the pure KVP/REST style approach you dont have to pass it a query template but can query it directly. You have to supply an "output model", i.e. define the structure that you want it to return. Just try:
http://www.antweb.org/tapirlink/www/tapir.php/antweb?op=s&start=0&limit=10&m=http://rs.tdwg.org/tapir/cs/dwc/1.4/model/dw_core_geo_cur.xml&ScientificName=case
If you just need to know the existing names or any other mapped property (concept in TAPIR speak) you can use inventories which are extremely simple as KVP, because they dont need any output model. So the following gives you all names in AntWeb:
http://www.antweb.org/tapirlink/www/tapir.php/antweb?op=i&start=0&limit=10&c=http://rs.tdwg.org/dwc/dwcore/ScientificNameIf your provider maps aliases to the long qualified concept names you can get even shorter URLs if thats what you are after. For example the Publishing Toolkit you can query with just this:
http://ipt.gbif.org/tapir/11?op=i&c=ScientificName
I think you know that TAPIR and its precursors DiGIR and BioCASe were designed for distributed querying, not harvesting. Thats why we now propose much simpler formats namely the darwin core archive for harvesting and indexing.


Also TAPIR wanted to be backwards compatible and introduced the REST/KVP style API in addition to the XML based requests. And it was meant to be generic and doesnt come with default biodiversity formats - which probably would have been a good idea.

Roderic Page said...

Markus, it just reminds me so much of OpenURL, a vastly over-engineered spec that is hell to do anything with. I'm sure there would be a way to improve this so that basic queries don't require the amount of hassle I went through. Admittedly I didn't spend much time reading the documentation, but really, I shouldn't have to...