Friday, April 19, 2024

Notes on transforming BHL images

How to cite: Page, R. (2024). Notes on transforming BHL images https://doi.org/10.59350/2gpbb-98a53

I’ve been down this road before, e.g. BHL, DjVu, and reading the f*cking manual and Demo of full-text indexing of BHL using CouchDB hosted by Cloudant, but I’m revisiting converting BHL page scans to black and white images, partly to clean them up, to make them closer to what a modern reader might expect, and partly to reduce the size of the image. The latter means faster loading times and smaller PDFs for articles.

The links above explored using foreground image layers from DjVu (less useful now that DjVu is almost dead as a format), and using CSS in web browsers to convert a colour image to gray scale. I’ve also experimented with the approach taken by Google Books (see https://github.com/rdmpage/google-book-images), which uses jbig2enc to compress images and reduce the number of colours.

In my latest experiments, I use jbig2enc to transform BHL page images into black and white images where each pixel is either black or white (i.e., image depth = 1), then use ImageMagick to resize the image to the Google Books width of 685 pixels and a depth of 2. Typically this gives an image around 25Kb - 30Kb in size. It looks clean and readable.

This approach breaks down for photographs and especially colour plates. For example, this image looks horrible:

When compressing images that have photos or illustrations jbig2enc can extract the part of the image that includes the illustration, for example:

This isn’t perfect, but it raises the possibility that we can convert text and line drawings to black and white, and then add back photographs and plates (whether black or white, or colour). After some experimentation using tools such as ImageMagick composite I have a simple workflow:

  • compress page image using jbig2enc
  • take the extracted illustration and set all white pixels to be transparent
  • convert the black and white image output by jbig2enc to colour (required for the next step)
  • create a composite image by overlaying the extracted illustration (now on a transparent background) on top of the black-and-white page image

The result looks passable:

In this case, we still have a lot of the sepia-toned background, the illustration hasn’t been cleanly separated, but we do at least get some colour.

Still work to do, but it looks promising and suggests a way to make dramatically smaller PDFs of BHL content. There are crude code and example files in GitHub.

Update

Some Googling turned up Removing orange tint-mask from color-negatives, which gives us the following command:

convert 16281585.jpg -negate -channel all -normalize -negate -channel all 16281585-rgb.jpg

Applying this to our image results in:

This looks a lot better. Results will vary depending on the eveness of the page scan (i.e., is there a shadow on the image), but I think this gives us a way to display the plates with a higher degree of contrast.

Reading

Adam Langley, Dan S. Bloomberg, “Google Books: making the public domain universally accessible”, Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000H (2007/01/29); doi:10.1117/12.710609

Written with StackEdit.

Wednesday, March 27, 2024

Hugging Face Autotrain

How to cite: Page, R. (2024). Hugging Face Autotrain https://doi.org/10.59350/7p1n4-wdv84

These are notes to myself on using Hugging Face AutoTrain. The first version of this had a very nice interface where you could simply upload a folder of images and train a model. It was limited in the range of tasks and models, but made up for that in ease of use. Now AutoTrain has been replaced by AutoTrain Advanced, which not everyone is happy about.

Training a model

After a bit of fussing about (and paying attention to the log messages) I’ve managed to train a model to classify images in much the same way as before. The steps are as follows:

Go to AutoTrain Advanced. You should see a screen like this:

By default Docker and AutoTrain are selected. It will also show the free hardware spec (CPU basic • 2 vCPU • 16GB). I found that for image classification this hardware choice would cause AutoTrain to fail, so I selected Nvidia T4 small • 4 vCPU • 15GB.

Give your space a name and click on Create Space to create the space. You will now see something like this:

It took 3-4 minutes to build the space. Once the space is built you will then be asked to log in to Hugging Face (seems odd, but that’s what it asks you to do). You are then asked to give your space permissions to connect to your account.

Now you will see a slightly scary looking interface (this is one reason why people miss the old “easy” AutoTrain).

For Task I selected Image Classification and the default base model (google/vit-base-patch16-224). I ignored every other setting, and simply uploaded the training data. This was a zip file containing separate folders for each category of image, so that images, say of cats, would be in a folder called cats, pictures of dogs would be in dogs, etc.

I then clicked Start and after a warning that this would cost money (I subscribe to Hugging Face)saw this:

You can track progress in the logs, which you can see using the middle of the buttons below.

Once completed, the space pauses, which is a little alarming but simply means that it has finished training. Yay, you now have a trained model!

When I first tried this, I got errors because I didn’t upload the data in the proper format (my zip file had a folder that contained the training data folders, it needs the folders to be in the root of the zip archive). It also failed to train on the base (free) hardware, I only discovered this by looking at the logs and see error messages regarding the lack of a GPU.

What now?

The other thing about the original AutoTrain was that it gave you an app to explore how you model worked on other data. The new AutoTrain simply pauses after training and you are left with “um, what do I do now?”

After some fussing I discovered that in my profile I now had a brand new Model appearing in my list of models.

If I click on the model I go to the model page, where there is a Deploy button, this is how you get an app. First though, make sure your model is publicly visible (by default it is private). Click on Settings and go to the Change model visibility to make it public. If you now click on the Deploy button you will see a list of options:

I picked Spaces. This enables you to create a simple online app. I accepted all the defaults (including the base, free hardware with no GPU) and in a couple of minutes you get a app that looks like this:

Upload an image, press Submit and you will get a classification of that image:

Apps tend to sleep, so it may be that you come back to an app, load and image, and get an error message that the model is still loading. Wait a moment, try again, and it should work.

API

Using the app is fun, but if you wasn’t to use the model to classify lots of images then you want to use the API. The Deploy button lists `Inferences API (serverless) as an option. Clicking on that gives you the URL you can to POST images to, it will return the results in JSON. As with the app, if the model is sleeping then your first call may through an error, typically wait a moment and try again, and then you can classify images in bulk.

Summary

Hugging Face is quite an extraordinary tool, and it is a way to try and make sense of the xplosiuon of AI techniques available. But it is clearly written by developers for developers, and that can make it intimidating, even for someone like me who writes code, uses GitHub, etc. The original AutoTrain was a joy to use in comparison, and this feels like a missed opportunity where Hugging Face could have keep both the old "easy" version alongside the new, more powerful, but rather clunkier "advanced" version. Still, this is easier than dealing directly with the hellscape that is Python.

Written with StackEdit.

Tuesday, February 20, 2024

Problems with the DataCite Data Citation Corpus

How to cite: Page, R. (2024). Problems with the DataCite Data Citation Corpus https://doi.org/10.59350/t80g1-xys37

DataCite have released the Data Citation Corpus, together with a dashboard that summarises the corpus. This is billed as:

A trusted central aggregate of all data citations to further our understanding of data usage and advance meaningful data metrics

The goal is to build a citation database between scholarly articles and data, such as datasets in repositories, sequences in GenBank, protein structures in PDB, etc. Access to the corpus can be obtained by submitting a form, then having a (very pleasant) conversation with DataCite about the nature of the corpus. This process feels clunky because it introduces friction. If you want people to explore this, why not make it a simple download?

I downloaded the corpus, which is nearly 7 Gb of JSON, formatted as an array(!), thankfully with one citation per line so it is reasonably easy to parse. (JSON Lines would be more convenient).

I loaded this into a SQLite database to make it easier to query, and I have some thoughts. Before outling why I think the corpus has serious problems, I should emphasise that I’m a big fan of what DataCite are trying to do. Being able to track data usage to give credit to researchers and repositories (citations to data as well as papers), to track provenance of data (e.g., when a GenBank sequence turns out to be wrong being able to find all the studies that used it), and to find addition links between papers beyond bibliographic links (e.g., when data is cited but not the original publication) are all good things. Obviously, lots of people have talked about this, but this is my blog so I’ll cite myself as an example 😉.

Page, R. Visualising a scientific article. Nat Prec (2008). https://doi.org/10.1038/npre.2008.2579.1

My main interest in the corpus is tracking citations of DNA sequences, which are often not linked to even the original publication in GenBank. I was hopeful the corpus could help in this work.

Ok, let’s now look at the actual corpus.

Data structure

Each citation comprises a JSON object, with a mix of external identifiers such as DOIs, and internal identifiers as UUIDs. The later are numerous, and make the data file much bigger than it needs to be. For example, there are two sources of citation data, DataCite, and the Chan Zuckerberg Initiative. These have sourceId values of 3644e65a-1696-4cdf-9868-64e7539598d2 and c66aafc0-cfd6-4bce-9235-661a4a7c6126, respectively. There are a little over 10 million citations in the corpus, so that’s a lot of bytes that could simply have been 1 or 2.

More frustrating than the wasted space is the lack of any list of what each UUID means. I figured out that 3644e65a-1696-4cdf-9868-64e7539598d2 is DataCite only by looking at the data, knowing that CZI had contributed more ecords than DataCite. For other entities such as repositories and publishers, one has to go spelunking in the data to make reasonable guesses as to what the repositories are. Given that most citations seem to be to biomedical entities, why not use something such as the compact identifiers from Identifiers.org for each reppository?

Dashboard

DataCite provides a dashboard to summarise key features of the corpus. There are a couple of aspects of the dashboard that I find frustrating.

Firstly, the “citation counts by subject” is misleading. A quick glance suggests that law and sociology are the subjects that most actively cite data. This would be surprising, especially given that much of the data generated by CZI comes from PubMed Central. Only 50,000 citations out of 10 million comprise articles with subject tags, so this chart is showing results for approximately 0.5% of the corpus. The chart includes the caveat “The visualization includes the top 20 subiects where metadata is available.” but omits to tell us that as a result the chart is irrelevant for >99% of the data.

The dashboard is interesting in what it says about the stakeholders of this project. We see counts of citations broken down by source (CZI or DataCite), and publisher, but not by repository. This suggests that repositories are second class citizens. Surely they deserve a panel on the dashboard? I suspect researchers are going to be more interested in what kinds of data are being cited than what academic publishers are in the corpus. For instance, 3.75 million (37.5%) citations are to sequences in GenBank, 1.7 million (17.5%) are to the Protein Data Bank (PDB), and 0.89 million (8.9%) are to SNPs.

Chan Zuckerberg Initiative and AI

The corpus is a collaboration between DataCite and the Chan Zuckerberg Initiative (CZI) and CZI are responsible for the bulk of the data. Unfortunately there is no description of how those citations were extracted from the source papers. Perhaps CZI used something like SciBERT which they employed in earlier work to extract citations to scientific software https://arxiv.org/abs/2209.00693? We don’t know. One reason this matters is that there are lots of cases where the citations are incorrect, and if we are going to figure out why, we need to know how they were obtained. At present it is simply a black box.

These are just a few examples of incorrect citations:

These are just a few examples I came across while pottering around with the corpus. I’ve not done any large-scale analysis, but one ZooKeys article I came across https://doi.org/10.3897/zookeys.739.21580 cites 32 entities, only four of which are correct.

I get that text mining is hard, but I would expect AI would do better than what we could achieve by simply matching dumb regular expressions. For example, surely a tool that claims any measure of intelligence would be able to recognised that this sentence lists grant numbers, not a GenBank accession number?

Funding This study was supported by Longhua Hospital Shanghai University of Traditional Chinese Medicine (grant number: Y21026), and Longhua Hospital Shanghai University of Traditional Chinese Medicine (YW.006.035)

As a fallback, we could also check that a given identifier is valid. For example, there is no sequence with the accession number Y21026. The set of possible identifiers is finite (if large), why didn’t the corpus check whether each identifier extracted actually existed?

Update: major errors found

I've created a GitHub repo to keep track of the errors I'm finding.

Protein Data Bank

The Protein Data Bank (PDB) is the second largest repository in the corpus with 1,729,783 citations. There are 177,220 distinct PDB identifiers cited. These identifiers should match the pattern /^[0-9][A-Za-z0-9]{3}$/, that is, a number 0-9 followed by three alphanumeric characters. However 31,612 (18%) do not. Examples include "//osf.io/6bvcq" and "//evs.nci.nih.gov/ftp1/CTCAE/CTCAE_4.03/Archive/CTCAE_4.0_2009-05-29_QuickReference_8.5x11.pdf". So the tools for finding PDB citations do not understand what a PDB identifier should look like.

Out of curiousity I downloaded all the exiting PDB identifiers from https://files.wwpdb.org/pub/pdb/holdings/current_file_holdings.json.gz, which gave me 216,225 distinct PDB identifiers. Comparing actual PDB identifiers with ones included in the corpus I got 1,233,993 hits, which is 71% of the total in the corpus. Hence over half a million (a little under a third of the PDB citations) appear to be made up.

Individual articles

Taxonomic revision of Stigmatomma Roger (Hymenoptera: Formicidae) in the Malagasy region

The paper https://doi.org/10.3897/BDJ.4.e8032 is credited with citing 126 entities, including 108 sequences and 14 PDB records. None of this is true. The supposed PDB records are figure numbers, e.g. “Fig. 116d” becomes PDB 116d, and the sequence accession numbers are specimen codes or field numbers.

Nucleotide sequences

Sequence data is the single largest data type cited in the corpus, with 3.8 million citations. I ran a sample of the first 1000 sequences accession numbers in the corpus against GenBank and in 486 cases GenBank didn't recognise the accession number as valid. So potentially half the sequence citations are wrong.

Summary

I think the Data Citation Corpus is potentially a great resource, but if it is going to be “[a] trusted central aggregate of all data citations” then I think there are a few things it needs to do:

  • Make the data more easily accessible so that people can scrutinise it without having to jump through hoops
  • Tell us how the Chan Zuckerberg Initiative did the entity matching
  • Improve the entity matching
  • Add a quality control step that validates extracted identifiers
  • Expand the dashboard to give users a better sense of what data is being cited

Written with StackEdit.