Sunday, January 18, 2009

Equivalent author names

One problem I've encountered in building a bibliographic database is the different ways author names are written. For example, for papers I've authored my name may be written as "Roderic D. M. Page" or "R. D. M. Page". Googling about this problem I came across Dror Feitelson's paper On identifying name equivalences in digital libraries. Feitelson addresses the issue of matching first names:
The services provided by digital libraries can be much improved by correctly identifying variants of the same name. For example, this will allow for better retrieval of all the works by a certain author. We focus on variants caused by abbreviations of first names, and show that significant achievements are possible by simple lexical analysis and comparison of names. This is done in two steps: first a pairwise matching of names is performed, and then these are used to find cliques of equivalent names. However, these steps can each be performed in a variety of ways. We therefore conduct an experimental analysis using two real datasets to find which approaches actually work well in practice. Interestingly, this depends on the size of the repository, as larger repositories may have many more similar names.
Feitelson's solution is to construct a graph of similarity between first names, then find weighted cliques grouping equivalent names. For example, given the first names "Ace D. E.", "A. D.", "Abe F. G.", "Abe Bob C.", "A. B. C.", and "Abe B", we create the graph below where the edges are weighted by similarity between the names:

In this example, the names "Abe Bob C", "A B C", and "Abe B" are equivalent, as are "Ace D E" and "A D", leaving "Abe F G" by itself.

I've implemented Feitelson's weighted clique algorithm in a PHP script that calls a C++ program that does the clique analysis. Results can be returned in HTML or JSON. You can try the service at You can also call the service directly by a HTTP POST request to the URL with these parameters:

namesstringList of first names, separated by end of line (\n) character
formathtml or jsonFormat of the results


Anonymous said...

Very nice! Not to spoil the achievement, but isn't the real problem the false positives? For example, there's a clinician with the same last name and first initial as mine who publishes about 10x more than I do :) For a given citation (almost all of which only have first initial and last name), how do you tell which of the two it is? The Feitelson algorithm clusters this into 3 authors, which is wrong in two ways.

Roderic Page said...

I suspect that the extent to which false positives are a problem depends on scope. Within PubMed it will be bigger problem than within, say, just the phylogenetic literature. It might be nice to have service that could quantify the likelihood that the same name referred to different people (perhaps based on patterns of coauthorship, dates, and journals).

Anonymous said...

You could also take a look at how they solved this problem in the music database,
They had to solve it both for artist names and for track names.

Mark Wilden said...

I'm getting back good results with JSON, but not with HTML.

curl -X POST -d "names=Mark%0AM." -d "format=json" ''

works fine, but

curl -X POST -d "names=Mark%ark%0AM." -d "format=html" ''


well, I don't know how to post HTML on Blogger.

Roderic Page said...

Thanks for spotting this, something weird was going on with PHP's handling of the output from the clustering program. Fixed now.

Mark Wilden said...

Works great now!