Monday, November 10, 2008

From bibliographic coupling to data coupling

Bibliographic coupling is a term coined by Kessler (doi:10.1002/asi.5090140103) in 1963 as a measure of similarity between documents. If two documents, A and B, cite a third, C, then A and B are coupled.

I'm interested in extending this to data, such as DNA sequences and specimens. In part this is because within the challenge dataset I'm finding cases where authors cite data, but not the paper publishing the data. For example, a paper may list all the DNA sequences in uses (thus citing the original data), but not the paper providing the data.

To make this concrete, the paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls in Myiarchus tyrant-flycatchers (Aves: Passeriformes: Tyrannidae)" doi:10.1016/S1055-7903(03)00259-8 lists the sequences used, but does not cite the source of three of these (which is the Science paper "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" (doi:10.1126/science.1065005). As a result, if I was reading "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" and wanted to learn who had cited it I would miss the fact that paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls..." had used the data (and hence, in effect, "cited" the paper). In some cases, data citation may be more relevant than bibliographic citation because it relates to people using the data, which seems a more significant action than simply reading the paper.

Note that I'm not interested in the issue of credit as such. In the above example, the authors of the Science paper are also coauthors of the "shakes, rattles, and rolls" paper, and hence show commendable restrain in not citing themselves. I'm interested in the fate of the data. Who has used it? What have they done with it? Has anybody challenged the data (for example, suggesting a sequence was misindentified)? These are the things that a true "web of data" could tell us.


Robert Huber said...

Interesting! The problem here is that Ricklefs and Bermingham are wrongly cited, Reference 14 has a link which directs to the supplementary material and not to the original paper..
Btw.: When an author cites primary data does he really also cite (mean) the paper it was published in? I think primary data can be cited itself. But of course the metadata of the data (Arrgghh) must include a link to the publication in which this data was first published..

Roderic Page said...


I would argue that primary data can be cited just like a publication, and we already implicitly doing this when we list Genbank accession numbers, specimen codes, etc. What I'm aiming to do is extract this information.

If we have identifiers for publications and data, then it's pretty easy to link the two, we just need a database of those links, which is what I'm playing with for the Challenge.