iPhylo: Problems with the DataCite Data Citation Corpus

Roderic D. M. Page

Tuesday, February 20, 2024

Problems with the DataCite Data Citation Corpus

How to cite: Page, R. (2024). Problems with the DataCite Data Citation Corpus https://doi.org/10.59350/t80g1-xys37

DataCite have released the Data Citation Corpus, together with a dashboard that summarises the corpus. This is billed as:

A trusted central aggregate of all data citations to further our understanding of data usage and advance meaningful data metrics

The goal is to build a citation database between scholarly articles and data, such as datasets in repositories, sequences in GenBank, protein structures in PDB, etc. Access to the corpus can be obtained by submitting a form, then having a (very pleasant) conversation with DataCite about the nature of the corpus. This process feels clunky because it introduces friction. If you want people to explore this, why not make it a simple download?

I downloaded the corpus, which is nearly 7 Gb of JSON, formatted as an array(!), thankfully with one citation per line so it is reasonably easy to parse. (JSON Lines would be more convenient).

I loaded this into a SQLite database to make it easier to query, and I have some thoughts. Before outling why I think the corpus has serious problems, I should emphasise that I’m a big fan of what DataCite are trying to do. Being able to track data usage to give credit to researchers and repositories (citations to data as well as papers), to track provenance of data (e.g., when a GenBank sequence turns out to be wrong being able to find all the studies that used it), and to find addition links between papers beyond bibliographic links (e.g., when data is cited but not the original publication) are all good things. Obviously, lots of people have talked about this, but this is my blog so I’ll cite myself as an example 😉.

Page, R. Visualising a scientific article. Nat Prec (2008). https://doi.org/10.1038/npre.2008.2579.1

My main interest in the corpus is tracking citations of DNA sequences, which are often not linked to even the original publication in GenBank. I was hopeful the corpus could help in this work.

Ok, let’s now look at the actual corpus.

Data structure

Each citation comprises a JSON object, with a mix of external identifiers such as DOIs, and internal identifiers as UUIDs. The later are numerous, and make the data file much bigger than it needs to be. For example, there are two sources of citation data, DataCite, and the Chan Zuckerberg Initiative. These have sourceId values of 3644e65a-1696-4cdf-9868-64e7539598d2 and c66aafc0-cfd6-4bce-9235-661a4a7c6126, respectively. There are a little over 10 million citations in the corpus, so that’s a lot of bytes that could simply have been 1 or 2.

More frustrating than the wasted space is the lack of any list of what each UUID means. I figured out that 3644e65a-1696-4cdf-9868-64e7539598d2 is DataCite only by looking at the data, knowing that CZI had contributed more ecords than DataCite. For other entities such as repositories and publishers, one has to go spelunking in the data to make reasonable guesses as to what the repositories are. Given that most citations seem to be to biomedical entities, why not use something such as the compact identifiers from Identifiers.org for each reppository?

Dashboard

DataCite provides a dashboard to summarise key features of the corpus. There are a couple of aspects of the dashboard that I find frustrating.

Firstly, the “citation counts by subject” is misleading. A quick glance suggests that law and sociology are the subjects that most actively cite data. This would be surprising, especially given that much of the data generated by CZI comes from PubMed Central. Only 50,000 citations out of 10 million comprise articles with subject tags, so this chart is showing results for approximately 0.5% of the corpus. The chart includes the caveat “The visualization includes the top 20 subiects where metadata is available.” but omits to tell us that as a result the chart is irrelevant for >99% of the data.

The dashboard is interesting in what it says about the stakeholders of this project. We see counts of citations broken down by source (CZI or DataCite), and publisher, but not by repository. This suggests that repositories are second class citizens. Surely they deserve a panel on the dashboard? I suspect researchers are going to be more interested in what kinds of data are being cited than what academic publishers are in the corpus. For instance, 3.75 million (37.5%) citations are to sequences in GenBank, 1.7 million (17.5%) are to the Protein Data Bank (PDB), and 0.89 million (8.9%) are to SNPs.

Chan Zuckerberg Initiative and AI

The corpus is a collaboration between DataCite and the Chan Zuckerberg Initiative (CZI) and CZI are responsible for the bulk of the data. Unfortunately there is no description of how those citations were extracted from the source papers. Perhaps CZI used something like SciBERT which they employed in earlier work to extract citations to scientific software https://arxiv.org/abs/2209.00693? We don’t know. One reason this matters is that there are lots of cases where the citations are incorrect, and if we are going to figure out why, we need to know how they were obtained. At present it is simply a black box.

These are just a few examples of incorrect citations:

The mouse line Prdm^11tm1.1ahl is conflated with the PDB identifier 1ahl, see https://hyp.is/c2Xras_KEe6zEGcm97yBRw/journals.plos.org/plosone/article?id=10.1371/journal.pone.0134503
A museum specimen CR00240699 is mistakenly interpreted as a GenBank accession number, see https://hyp.is/CGTJcM_kEe674TfyvGLC0A/zookeys.pensoft.net/article/21580/download/pdf/287887
A grant number Y21026 is is mistakenly interpreted as a GenBank accession number, see https://hyp.is/HpVXhs9PEe6D2UMxrIdqJw/bmjopen.bmj.com/content/12/9/e054887
The time period 24 hours (24hr) is conflated with a PDB record 24hr that doesn’t exist https://hyp.is/dNfqZs9SEe6U2nMOKHb-Pw/journal.waocp.org/article_89819_8835738205ecaaad36eebfa826a17779.pdf. There are a lot of these, such as 17^th, 2016, etc.

These are just a few examples I came across while pottering around with the corpus. I’ve not done any large-scale analysis, but one ZooKeys article I came across https://doi.org/10.3897/zookeys.739.21580 cites 32 entities, only four of which are correct.

I get that text mining is hard, but I would expect AI would do better than what we could achieve by simply matching dumb regular expressions. For example, surely a tool that claims any measure of intelligence would be able to recognised that this sentence lists grant numbers, not a GenBank accession number?

Funding This study was supported by Longhua Hospital Shanghai University of Traditional Chinese Medicine (grant number: Y21026), and Longhua Hospital Shanghai University of Traditional Chinese Medicine (YW.006.035)

As a fallback, we could also check that a given identifier is valid. For example, there is no sequence with the accession number Y21026. The set of possible identifiers is finite (if large), why didn’t the corpus check whether each identifier extracted actually existed?

Update: major errors found

I've created a GitHub repo to keep track of the errors I'm finding.

Protein Data Bank

The Protein Data Bank (PDB) is the second largest repository in the corpus with 1,729,783 citations. There are 177,220 distinct PDB identifiers cited. These identifiers should match the pattern /^[0-9][A-Za-z0-9]{3}$/, that is, a number 0-9 followed by three alphanumeric characters. However 31,612 (18%) do not. Examples include "//osf.io/6bvcq" and "//evs.nci.nih.gov/ftp1/CTCAE/CTCAE_4.03/Archive/CTCAE_4.0_2009-05-29_QuickReference_8.5x11.pdf". So the tools for finding PDB citations do not understand what a PDB identifier should look like.

Out of curiousity I downloaded all the exiting PDB identifiers from https://files.wwpdb.org/pub/pdb/holdings/current_file_holdings.json.gz, which gave me 216,225 distinct PDB identifiers. Comparing actual PDB identifiers with ones included in the corpus I got 1,233,993 hits, which is 71% of the total in the corpus. Hence over half a million (a little under a third of the PDB citations) appear to be made up.

Individual articles

Taxonomic revision of Stigmatomma Roger (Hymenoptera: Formicidae) in the Malagasy region

The paper https://doi.org/10.3897/BDJ.4.e8032 is credited with citing 126 entities, including 108 sequences and 14 PDB records. None of this is true. The supposed PDB records are figure numbers, e.g. “Fig. 116d” becomes PDB 116d, and the sequence accession numbers are specimen codes or field numbers.

Nucleotide sequences

Sequence data is the single largest data type cited in the corpus, with 3.8 million citations. I ran a sample of the first 1000 sequences accession numbers in the corpus against GenBank and in 486 cases GenBank didn't recognise the accession number as valid. So potentially half the sequence citations are wrong.

Summary

I think the Data Citation Corpus is potentially a great resource, but if it is going to be “[a] trusted central aggregate of all data citations” then I think there are a few things it needs to do:

Make the data more easily accessible so that people can scrutinise it without having to jump through hoops
Tell us how the Chan Zuckerberg Initiative did the entity matching
Improve the entity matching
Add a quality control step that validates extracted identifiers
Expand the dashboard to give users a better sense of what data is being cited

Written with StackEdit.