Thursday, December 23, 2010

BHL and OCR

Some quick notes on OCR. Revisiting my DjVu viewer experiments it really struck me how "dirty" the OCR text is. It's readable, but if we were to display the OCR text rather than the images, it would be a little offputting. For example, in the paper A new fat little frog (Leptodactylidae: Eleutherodactylus) from lofty Andean grasslands of southern Ecuador (http://biostor.org/reference/229) there are 15 different variations of the frog genus Eleutherodactylus:

  • Eleutherodactylus
  • Eleutheroclactylus
  • Eleuthewdactyliis
  • Eleiitherodactylus
  • Eleuthewdactylus
  • Eleuthewdactylus
  • Eleutherodactyliis
  • Eleutherockictylus
  • Eleutlierodactylus
  • Eleuthewdactyhts
  • Eleiithewdactylus
  • Eleutherodactyhis
  • Eleiithemdactylus
  • Eleuthemdactylus
  • Eleuthewdactyhis

Of course, this is a recognised problem. Wei et al. Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL) (hdl:2142/14919) found that 35% of names in BHL OCR contained at least one wrong character. They compared the performance of two taxonomic name finding tools on BHL OCR (uBio's taxonFinder and FAT), neither of which did terribly well. Wei et al. found that different page types can influence the success of these algorithms, and suggested that automatically classifying pages into different categories would improve performance.

Personally, it seems to me that this is not the way forward. It's pretty obvious looking at the versions of "Eleutherodactylus" above that there are recognisable patterns in the OCR errors (e.g., "u" becoming "ii", "ro" becoming "w", etc.). After reading Peter Norvig's elegant little essay How to Write a Spelling Corrector, I suspect the way to improve the finding of taxonomic names is to build a "spelling corrector" for names. Central to this would be building a probabilistic model of the different OCR errors (such as "u" → "ii"), and use that to create a set of candidate taxonomic names the OCR string might actually be (the equivalent of Google's "did you mean", which is the subject of Norvig's essay). I had hoped to avoid doing this by using an existing tool, such as Tony Rees' TAXAMATCH, but it's a website not a service, and it is just too slow.

I've started doing some background reading on the topic of spelling correction and OCR, and I've created a group on Mendeley called OCR - Optical Character Recognition to bring these papers together. I'm also fussing with some simple code to find misspellings of a given taxonomic names in BHL text, use the Needleman–Wunsch sequence alignment algorithm to align those misspellings to the correct name, and then extract the various OCR errors, building a matrix of the probabilities of the various transformations of the original text into OCR text.

One use for this spelling correction would be in an interactive BHL viewer. In addition to showing the taxonomic names that uBio's taxonFinder has located in the text, we could flag strings that could be misspelt taxonomic names (such as "Eleutherockictylus") and provide an easy way for the user to either accept or reject that name. If we are going to invite people to help clean up BHL text, it would be nice to provide hints as to what the correct answer might be.