Showing posts with label Google Refine. Show all posts
Showing posts with label Google Refine. Show all posts

Wednesday, April 17, 2013

Reconciling author names using Open Refine and VIAF

RefineIn an earlier post I discussed using Open Refine (formerly Google Refine) to clean and reconcile taxon names. I've added an additional service that can be used to reconcile author names that uses the Virtual International Authority File (VIAF) API. Using this service we can match authors to VIAF identifiers (you may have noticed these appearing on people's pages in Wikipedia, e.g. Mary J. Rathbun's Wikipedia page lists her VIAF as 61796012).

To use the service follow the instructions in the earlier post but add the service:

http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_viaf.php

This service is fairly crude, in particular, I make no attempt to score the matches that VIAR returns because this would require parsing and normalising author names. This could be added if needed. If you want some exmaple names to try, here are some taxonomists:


George A Boulenger
G A Boulenger
Wilhelm Michaelsen
W Michaelsen
Colin Campbell Sanborn
Suzanne Hand
Philip Hershkovitz
Yehudah Leopold Werner
W B Spencer
Norman Platnick

Monday, February 06, 2012

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

RefineGoogle Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.

The services I've implemented so far are:
  • EOL http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
  • NCBI taxonomy http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ncbi.php
  • uBio FindIT http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php
  • WORMS http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_worms.php
  • GBIF http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_gbif.php
  • Global Names Index http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_globalnames.php


To use these you need to add the URLs above to Google Refine (see example below). The EOL, NCBI and WORMS do a basic name lookup. The uBio FindIT service extracts a taxonomic name from a string, and can be viewed as a "taxonomic name cleaner".

How to use reconciliation services

Start a Google Refine session. Save the names below to a text file and open it as a new project.

Names
Achatina fulica (giant African snail)
Acromyrmex octospinosus ST040116-01
Alepocephalus bairdii (Baird's smooth-head)
Alaska Sea otter (Enhydra lutris kenyoni)
Toxoplasma gondii
Leucoagaricus gongylophorus
Pinnotheres
Themisto gaudichaudii
Hyperiidae


You should see something like this:
Refine1

Click on the column header Names and choose ReconcileStart reconciling.

Refine2

A dialog will popup asking you to select a service.

Refine3

If you've already added a service it will be in the list on the left. If not, click the Add Standard Services... button at the bottom left and paste in the URL (in this case http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php).

Once the service has loaded click on Start Reconciling. Once it has finished you should see most of the names linked to uBio (click on a name to check this):

Refine4

Sometimes there may be more than one possible match, in which case these will be listed in the cell. Once you have reconciled the data you may want to do something with the reconciliation. For example, if you want to get the ids for the names you've just matched you can create a new column based on the reconciliation. Click on the Names column header and choose Edit columnAdd column based on this column.... A dialog box will be displayed:

Refine6

In the box labelled Expression enter cell.recon.match.id and give the column a name (e.g., "NamebankID"). You will now have a column of uBio NamebankIDs for the names:

Refine7

You could also get the names uBio extracted by creating a column based on the values of cell.recon.match.name. To compare this with the original values, click on the Names column header and choose ReconcileActionsClear reconciliation data. Now you can see the original input names, and the string uBio extracted from each name:

Refine8

These are some very simple ideas for using Google Refine with taxonomic name services. Obvious extensions would to use services that provide an "accepted name", or services that support approximate string matching so you could catch spelling mistakes (most of the services I've implemented here have some degree of support for these features).

Development notes
The code for these services is in Github (undocumented as yet, that's on the to do list). I had a few hiccups getting these services to work. There is detailed documentation at http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi, but this seems a little out of step with what actually happens. Based on the documentation I thought Google Refine called a reconciliation service using HTTP GET, but in fact it uses POST. Google Refine always called my reconciliation service using "Multiple Query Mode", which meant supporting this mode wasn't optional. Once these issues were sorted out (turning on the Java console as per David Huynh's tip helped) things work pretty well.