Wednesday, March 13, 2013

Bibliographic metadata pollution

I spend a lot of time searching the web for bibliographic metadata and links to digitised versions of publications. Sometimes I search Google and get nothing, sometimes I get the article I'm after, but often I get something like this:


If I search for Die cestoden der Vogel in Google I get masses of hits for the same thing from multiple sources (e.g., Google Books, Amazon, other booksellers, etc.). For this query we can happily click through pages and pages of results that are all, in some sense, the same thing. Sometimes I get the similar results when searching for an article, multiple hits from sites with metadata on that article, but few, if any with an actual link to the article itself.

One byproduct of putting bibliographic metadata on the web is that we are starting to pollute web space with repetitions of the same (or closely similar) metadata. This makes searching for definitive metadata difficult, never mind actually finding the content itself. In some cases we can use tools such as Google Scholar, which clusters multiple versions of the same reference, but Google Scholar is often poor for the kind of literature I am after (e.g., older taxonomic publications).

As Alan Ruttenberg (@alanruttenbergpoints out, books would seem to be a case where Google could extend its knowledge graph and cluster the books together (using ISBNs, title matching, etc.). But meantime if you think simply pumping out bibliographic metadata is a good thing, spare a thought for those of us trying to wade through the metadata soup looking for the "good stuff".