Monday, May 21, 2007

iTunes and citation metadata.

Stumbled across a really nice paper (Why can't I manage academic papers like MP3s?) while reading commentary on Tim O'Reilley's post about FreeBase.

In response to Danny Ayer's post, Tim O'Reilly wrote
I think you miss my point. I wasn't using centralized vs. decentralized as the point of contrast. I was using pre-specified vs. post-specified.

Now, I know you are much closer to all the sem web discussions than I am, and I probably have mischaracterized them somewhat. But you have to ask why are they so widely mischaracterized? There's some fire to go with that smoke.

In a very different context, on a mailing list, Bill Janssen wrote something very apposite:

"Let me recommend the paper "Why can't I manage academic papers like MP3s?", (yes, I realize accepted standards say I should put that comma inside the quotation marks) by James Howison and Abby Goodrum, at The basic thesis is that our common document formats weren't designed for use with digital repositories, and that metadata standards are often designed for the use of librarians and publishers, who have different metadata concerns than end-users have."

That's the distinction between the Semantic Web and Web 2.0 that I was trying to get at.

Howison and Goodrum make some interesting points, especially about how easy it is to get (or create) metadata for a CD, especially when compared to handling academic literature. Charlie Rapple on All My Ey suggests that
While the authors' [Howison and Goodrum] pains have to an extent been resolved since, by online reference management/bookmarking tools such as Connotea or CiteULike (which both launched later that year), and by the increase in XML as a format for online articles (which unites the full text and metadata in one file), their issues with full text availability remain.

I think the pain is still there, especially as Connotea relies on articles having a DOI (or some other URI or identifier). Many articles don't have DOIs. Furthermore, often a paper will have a DOI but that DOI is not printed on the article (either the hard copy of the PDF). This is obviously true for articles that were published before DOIs came in to existence, but which now have DOIs, however it is also the case for some recent articles as well. This means we need to use metadata about the article to try and find a DOI. In contrast, programs like iTunes use databases such as Gracenote CDDB to retrieve metadata for a CD, where the CD's identity is computed based on information on the CD itself (i.e., the track length). The identifier is computed from the object at had.
This one reason why I like SICIs (Serial Item and Contribution Identifier, see my bookmarks for sici for some background). These can be computed from metadata about an individual article, often using just information printed with the article (although the ISSN number might not be). This, coupled with the collaborative nature of CD databases such as CDDB and freedb (users supply missing metadata) makes them a useful illustration for how we might construct a database of taxonomic literature. Users could contribute metadata about papers, with identifiers computed from the papers themselves.


harijay said...

If I am not wrong the mekentosj duo have put together an application that aggregates pdf files and makes them searchable by an iTunes like interface. The application is available for Mac only and is called Papers ..check out
I dont know how they extract and handle the metadata from pdf files though

Roderic Page said...

Papers is very nice, but only serves to make the original point even more painful -- iTunes "knows" what CD I've inserted, or what song is in the AAC file I've added, whereas Paper can't tell which paper a PDF contains. Howison and Goodrum's original point is why can't I just drag a PDF onto a program like Papers, and all the bibliographic metadata is automatically extracted?

Anonymous said...

"why can't I just drag a PDF onto a program like Papers, and all the bibliographic metadata is automatically extracted?"

Because there's no incentive for publishers to add this metadata - it's not like you can choose to buy the PDF somewhere else with better metadata.

I wrote a script a while ago that went some way towards automating identification of the paper and adding XMP metadata, but never took it any further.

Anonymous said...

BTW, Papers does automatically extract metadata from PDFs, when it exists.