iPhylo: citation matching

Roderic D. M. Page

Showing posts with label citation matching. Show all posts

Friday, June 06, 2014

Finding citations using full text search

Note to self on citation matching.

Looking for this paper "Fishes of the Marshall and Marianas islands. Vol. I. Families from Asymmetrontidae through Siganidae" I Googled it, adding "bistro" as a search term to see if I'd already added it to BioStor. The Google search:

https://www.google.co.uk/?gws_rd=ssl#q=Fishes+of+the+Marshall+and+Marianas+islands.+Vol.+I.+Families+from+Asymmetrontidae+through+Siganidae+biostor

found several hits in BioStor:

Google

What is interesting is that these hits are to full text of references that cite the article I'm after, not the article itself. I'm sure many have had this experience, where you are searching for an obscure article and you keep finding papers that cite it, rather than the actual paper you're after. But this suggests another strategy for building the citation graph for an article. If you have a decent corpus of full text articles, search for the article (using, say title, journal, pagination) in the text of those articles and store the hits. Those are the references that cite the article (OK, not all, but some of them). This may be a more attractive way of building the citation graph, rather than parsing citations in articles and trying to locate them. Indeed, it could be extended to help marking up those citations. Imagine grabbing blocks of text from near the end of an article, searching for those in a database of citations, using close matches to flag the corresponding block as a citation.

Need to think about this a little more...

Update

@CameronNeylon @rdmpage Your take reminds me of http://t.co/lo3n1q4XeD I attended ICDMW 2011 where he had this paper.
— Tuija Sonkkila (@ttso) June 8, 2014

The paper is:

Polepeddi, L., Agrawal, A., & Choudhary, A. (n.d.). Poll: A Citation Text Based System for Identifying High-Impact Contributions of an Article. 2011 IEEE 11th International Conference on Data Mining Workshops. IEEE. doi:10.1109/icdmw.2011.136 /blockquote>

Monday, October 22, 2012

Resolving free-form citations

We don't need no stinkin' parser- a guide to resolving free-form citations with #CrossRef Metadata Search -> goo.gl/f9a4e
— Geoffrey Bilder (@gbilder) October 18, 2012

CrossRef have released CrossRef Metadata Search a nice tool that can take a free-form citation and return possible matches from CrossRef's database. If you get a match CrossRef can take the DOI and format for you it in a variety of styles using DOI content negotiation.

If, like me, you spend a lot of time trying to find DOIs (and other identifiers) for articles by first parsing citations into their component parts, then this is good news. It's also good news for publishers that may balk at one of CrossRef's requirements for joining its club: if you want DOIs for your articles it's not enough to submit metadata for your article, you also need to submit the list of references that article cites, including their DOIs. This requirement enables CrossRef to offer their "cited by" service, but imposes a burden on smaller journals operating on a tight budget (e.g., Zootaxa). With CrossRef Metadata Search you can just send author-supplied citation strings from the manuscript and have a good chance of finding the corresponding DOI, if it exists.

Of course, the service only works if the article has a DOI, so it's not a complete solution to being able to parse bibliographic citations into their component parts. But it's a nice model, and I'm tempted to apply the same approach to my databases, such as BioStor or my ever growing Mendeley library (which is larger than the Mendeley desktop client can easily handle). A quick way to do this would be to use Cloudant which has cloud-based CouchDB coupled with a Lucene-based fulltext search engine. If I've time I may try and put a demo together.

Tuesday, September 13, 2011

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.

To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:

Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
Using the most common citation styles, generate a set of possible citations for each reference.
Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.