Monday, September 28, 2009

Google Scholar metadata quality and Mendeley hype

Hot on the heels of Geoffrey Nunberg's essay about the train wreck that is Google books metadata (see my earlier post) comes Google Scholar’s Ghost Authors, Lost Authors, and Other Problems by Péter Jacsó. It's a fairly scathing look at some of the problems with the quality of Google Scholar's metadata.

Now, Google Scholar isn't perfect, but it's come to play a key role in a variety of bibliographic tools, such as Mendeley, and Papers. These tools do a delicate dance with Google Scholar who, strictly speaking, don't want anybody scraping their content. There's no API, so Mendeley, Papers (and my own iSpecies) have to keep up with the HTML tweaks that Google introduces, pretend to be web browsers, fuss with cookies, and try to keep the rate of queries below the level at which the Google monster stirs and slaps them down.

Jacsó's critique also misses the main point. Why do we have free (albeit closed) tools like Google Scholar in the first place? It's largely because scientists have ceeded the field of citation analysis to commercial companies, such as Elsevier and Thompson Reuters. To echo Martin Kalfatovic's comment:
Over the years, we've (librarians and the user community) have allowed an important class of metadata - specifically the article level metadata - migrate to for profit entities.
Some visionaries, such as Robert Cameron in his A Universal Citation Database as a Catalyst for Reform in Scholarly Communication, argued for free, open citation databases, but this came to nought.

For me, this is the one thing the ridiculously over-hyped Mendeley could do that would merit the degree of media attention it is getting -- be the basis of an open citation database. It would need massive improvement to its metadata extraction algorithms, which currently suck (Google Scholar's, for all Jacsó's complaints, are much better), but it would generate something of lasting value.