Thursday, August 07, 2025

Make Data Count Kaggle Competition

I’ve written several times here about the Make Data Count project and its major output to date, the Data Citation Corpus, currently at version 4 (see The fourth release of the Data Citation Corpus incorporates data citations from Europe PMC and additions to affiliation metadata).

In June Make Data Count launched a Kaggle Competition with the goal of developing a tool that will process articles (in either PDF or XML format), extract data citations (e.g., DOIs for datasets in repositories such as Dryad, or accession numbers such as 6TAP in the Protein Data Bank), and classify these citations as either “primary” (data published in that paper) or “secondary” (reuse of existing data

I think the competition is an excellent idea, and the $US100,000 is a great motivator to get people trying to solve this problem. I’m tacking part in the competition, which has meant learning Python very fast. I’ve dabbled a bit before, but this was a whole new thing. ChatGPT has been indespensible, especially in explaining why something I was doing wasn’t going to work, and what an error message really meant. The whole process became horribly addictive. You can submit a solution on five tiems a day, and the counter resets at midnight GMT, so there were nights I was up well after midnight coding and using up the following day’s submission quota! Another interesting feature is the lively discussion between people that are rivals for substantial prize money. Participants are sharing code and ideas, often not their best scoring ideas — after all, everyone wants to win — but still giving hints and support, and sharing findings.

The competition provides a small set of training data (about 500 PDFs and a simialr number of XML files). The idea is that you write code to analyse those files and output a list of data citations. You then submit your entry to Kaggle, which runs your code against a “hidden” set of PDFs and XML files and tells you your score. The best score wins prizes. My place in this competiton pretty accurately reflects my skills and ability :)

Issues with the competition

Unfortunately the competition itself has been — how shall I put this — poorly run. There has been virtually no engagement from DataCite in their own competition, despite repeated queries from the entrants to explain the often inexplicable reasoning for the scoring in the training data, or why some of the PDFs are wrong or incomplete. Some PDFs are preprints, not the actual papers (and may differ in whether they cite data or not). The XML comes in a variety of formats, which we weren’t told about. Some XML was “gold standard” JATS-XML as used by PubMed Central, others were publisher specific, or the output of PDF parsers or annotation tools.

I ended up making my own training data (https://doi.org/10.34740/kaggle/dsv/12667298) listing what I think are the actual data citations (about twice as many as are in the “official” training data).

There are some high scoring entries (see the leaderboard) so it looks like Make Data Count will get somes useful tools form this competition. My only concern is that these tools may be optimised to replicate the somewhat erratic and poorly described annotation process that DataCite used to create the training and “hidden” test data, rather than accuarately retrieve the actual data citations. Perhaps my concerns will prove unfounded, or maybe the tools can be easily retrained with better data.

But I am somewhat baffled that such an importasnt project for which Make Data Count have secured funding for serious prize money has been essentially left unattended by the organisers.

The competition runs until 3 September.

References

Written with StackEdit.