Google Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use
Freebase reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.
The services I've implemented so far are:
- EOL http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
- NCBI taxonomy http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ncbi.php
- uBio FindIT http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php
- WORMS http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_worms.php
- GBIF http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_gbif.php
- Global Names Index http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_globalnames.php
To use these you need to add the URLs above to Google Refine (see example below). The EOL, NCBI and WORMS do a basic name lookup. The uBio FindIT service extracts a taxonomic name from a string, and can be viewed as a "taxonomic name cleaner".
How to use reconciliation servicesStart a Google Refine session. Save the names below to a text file and open it as a new project.
Names
Achatina fulica (giant African snail)
Acromyrmex octospinosus ST040116-01
Alepocephalus bairdii (Baird's smooth-head)
Alaska Sea otter (Enhydra lutris kenyoni)
Toxoplasma gondii
Leucoagaricus gongylophorus
Pinnotheres
Themisto gaudichaudii
Hyperiidae
You should see something like this:

Click on the column header
Names and choose
Reconcile →
Start reconciling.

A dialog will popup asking you to select a service.

If you've already added a service it will be in the list on the left. If not, click the
Add Standard Services... button at the bottom left and paste in the URL (in this case
http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php).
Once the service has loaded click on
Start Reconciling. Once it has finished you should see most of the names linked to uBio (click on a name to check this):

Sometimes there may be more than one possible match, in which case these will be listed in the cell. Once you have reconciled the data you may want to do something with the reconciliation. For example, if you want to get the ids for the names you've just matched you can create a new column based on the reconciliation. Click on the
Names column header and choose
Edit column →
Add column based on this column.... A dialog box will be displayed:

In the box labelled
Expression enter
cell.recon.match.id and give the column a name (e.g., "NamebankID"). You will now have a column of uBio NamebankIDs for the names:

You could also get the names uBio extracted by creating a column based on the values of
cell.recon.match.name. To compare this with the original values, click on the
Names column header and choose
Reconcile →
Actions →
Clear reconciliation data. Now you can see the original input names, and the string uBio extracted from each name:

These are some very simple ideas for using Google Refine with taxonomic name services. Obvious extensions would to use services that provide an "accepted name", or services that support approximate string matching so you could catch spelling mistakes (most of the services I've implemented here have some degree of support for these features).
Development notesThe code for these services is in
Github (undocumented as yet, that's on the to do list). I had a few hiccups getting these services to work. There is detailed documentation at
http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi, but this seems a little out of step with what actually happens. Based on the documentation I thought Google Refine called a reconciliation service using HTTP GET, but in fact it uses POST. Google Refine always called my reconciliation service using "Multiple Query Mode", which meant supporting this mode wasn't optional. Once these issues were sorted out (turning on the Java console as per
David Huynh's tip helped) things work pretty well.