iPhylo: Accessing specimens using TAPIR or, why do we make this so hard?

Roderic D. M. Page

Monday, April 20, 2009

Accessing specimens using TAPIR or, why do we make this so hard?

OK, second rant of the day. One of my favourite online specimen databases is AntWeb. For a while the ability to harvest data from this database using the venerable DiGIR protocol hasn't been possible, due to various issues at the California Academy of Sciences. Well, now it's back, and "accessible" using TAPIR (TAPIR - TDWG Access Protocol for Information Retrieval). Accessible, that is, if you like horrifically over-engineered, poorly documented standards. OK, at lot of work has gone into TAPIR, there's lots of great code on SourceForge, and there's lots of documentation, but I've really struggled to get the most basic tasks done.

For example, let's imagine I and want to retrieve the information on the ant specimen CASENT0100367 (note how trivial this is via a web browser, just append the specimen name to http://www.antweb.org/specimen.do?name=). After much clenching of teeth struggling with the TAPIR documentation and the TAPIR client software, I finally found an email by Markus Döring that gave me the clue. If I'm going to construct a URL to retrieve this specimen record, I need to include the URL of an XML document that serves as a template for the query. Since one doesn't exist, I have to create it and make it accessible to the TAPIR server (i.e., the AntWeb TAPIR server needs to access it, so I have to place this XML document on my web server). The template (shown below) lives at http://bioguid.info/tapir/dwc_catalog_number.xml:


<?xml version="1.0" encoding="UTF-8"?>
<searchTemplate xmlns="http://rs.tdwg.org/tapir/1.0" 
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                xmlns:xs="http://www.w3.org/2001/XMLSchema" 
                xsi:schemaLocation="http://rs.tdwg.org/tapir/1.0
                                    http://rs.tdwg.org/tapir/1.0/schema/tapir.xsd
                                    http://www.w3.org/2001/XMLSchema
                                    http://www.w3.org/2001/XMLSchema.xsd">
  <label>Scientific name in query</label>
  <documentation>Query for a Scientific Name. Based on http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml, found in email by Markus Döring http://lists.tdwg.org/pipermail/tdwg-tapir/2008-April/000493.html</documentation>
  <externalOutputModel location="http://rs.tdwg.org/tapir/cs/dwc/1.4/model/dw_core_geo_cur.xml"/>
  <filter>
    <equals>
        <concept id="http://rs.tdwg.org/dwc/dwcore/CatalogNumber" />
        <parameter name="name"/>
    </equals>
  </filter>
</searchTemplate>

Now I can write my query: http://www.antweb.org/tapirlink/www/tapir.php/antweb
op=search
&start=0
&limit=1
&template=http://bioguid.info/tapir/dwc_catalog_number.xml
&name=casent0100367
So, the AntWeb server is going to read this query, and call my web server to get the query template to figure out what I actually want. Am I the only person who thinks that this is insane? Can anybody imagine going through these hoops to access a GenBank record, or a PubMed record?

Perhaps it's me, and my obsession with linking individual data records (rather than harvested lots of records, or federated search). But it strikes me that harvesting is a simple task and not many people will be doing it (at least, not on the scale of GBIF), and federated search is a non-starter as our community can't keep data providers online to save themselves.

In many ways I think TAPIR (and DiGIR before it) missed what for me is the most basic use case, namely I have a specimen identifier and I want to get the record for that specimen. These services make it much harder than it needs to be. It's a symptom of our field's inability to deliver simple tools that do basic tasks well, rather than overly general and highly complex tools that are poorly documented. Of course, retrieving individual records woud be easy if we have resolvable GUIDs for specimens, but we've singularly failed to deliver that, so we are stuck with very clunky tools. There's got to be a better way...