The services provided by digital libraries can be much improved by correctly identifying variants of the same name. For example, this will allow for better retrieval of all the works by a certain author. We focus on variants caused by abbreviations of first names, and show that significant achievements are possible by simple lexical analysis and comparison of names. This is done in two steps: first a pairwise matching of names is performed, and then these are used to find cliques of equivalent names. However, these steps can each be performed in a variety of ways. We therefore conduct an experimental analysis using two real datasets to find which approaches actually work well in practice. Interestingly, this depends on the size of the repository, as larger repositories may have many more similar names.Feitelson's solution is to construct a graph of similarity between first names, then find weighted cliques grouping equivalent names. For example, given the first names "Ace D. E.", "A. D.", "Abe F. G.", "Abe Bob C.", "A. B. C.", and "Abe B", we create the graph below where the edges are weighted by similarity between the names:
In this example, the names "Abe Bob C", "A B C", and "Abe B" are equivalent, as are "Ace D E" and "A D", leaving "Abe F G" by itself.
I've implemented Feitelson's weighted clique algorithm in a PHP script that calls a C++ program that does the clique analysis. Results can be returned in HTML or JSON. You can try the service at http://bioguid.info/services/. You can also call the service directly by a HTTP POST request to the URL http://bioguid.info/services/equivalent.php with these parameters:
|names||string||List of first names, separated by end of line (\n) character|
|format||html or json||Format of the results|