Monday, May 10, 2010

Next steps for BioStor: citation matching

Thinking about next steps for my BioStor project, one thing I keep coming back to is the problem of how to dramatically scale up the task of finding taxonomic literature online. While I personal find it oddly therapeutic to spend a little time copying and pasting citations into BioStor's OpenURL resolver and trying to find these references in BHL, we need something a little more powerful.

One approach is to harvest as many bibliographies as possible, and extract citations. These citations can come from online bibliographies, as well as lists of literature cited extracted from published papers. By default, these would be treated as strings. If we can parse them to extract metadata (such as title, journal, author, year), that's great, but this is often unreliable. We'd then cluster strings into sets that we similar. If any one of these strings was associated with an identifier (such as a DOI), or if one of the strings in the cluster had been successfully parsed into it's component metadata so we could find it using an OpenURL resolver, then we've identified the reference the strings correspond to. Of course, we can seed the clusters with "known" citation strings. For citations for which we have DOIs/handles/PMIDs/BHL/BioStor URIs, we generate some standard citation strings and add these to the set of strings to be clustered.

We could then provide a simple tool for users to find a reference online: paste in a citation string, the tool would find the cluster of strings the user's string most closely resembles, then return the identifier (if any) for that cluster (and, of course, we could make this a web service to automate processing entire bibliographies at a time).

I've been collecting some references on citation matching (bookmarked on Connotea using the tag "matching") related to this problem. One I'd like to highlight is "Efficient clustering of high-dimensional data sets with application to reference matching" (doi:10.1145/347090.347123, PDF here). The idea is that a large set of citation strings (or, indeed, any strings) can first be quickly clustered into subsets ("canopies"), within which we search more thoroughly:
canopy.png
When I get the chance I need to explore some clustering methods in more detail. One that appeals is the MCL algorithm, which I came across a while ago by reading PG Tips: developments at Postgenomic (where it is used to cluster blog posts about the same article). Much to do...