Monday, April 03, 2023

ChatGPT, of course

I haven’t blogged for a while, work and other reasons have meant I’ve not had much time to think, and mostly I blog to help me think.

ChatGPT is obviously a big thing at the moment, and once we get past the moral panic (“students can pass exams using AI!”) there are a lot of interesting possibilities to explore. Inspired by essays such as How Q&A systems based on large language models (eg GPT4) will change things if they become the dominant search paradigm — 9 implications for libraries and Cheating is All You Need, as well as [Paul Graham GPT]( I thought I’d try a few things and see where this goes.

ChatGPT can do some surprising things.

Parse bibliographic data

I spend a LOT of time working with bibliographic data, trying to parse it into structured data. ChatGPT can do this:

Note that it does more than simply parse the strings, it expands journal abbreviations such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan Branch of the Royal Asiatic Society”. So we can get clean, parsed data in a range of formats.

Parse specimens

Based on the success with parsing bibliographic strings I wondered how well it could handle citation software specimens (“material examined”). Elsewhere I’ve been critical of Plazi’s ability to do this, see Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?.

For example, given this specimen record on p. 130 of doi:10.5852/ejt.2021.775.1553

LAOS • Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;
17.13674° N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil and P. Prasankok leg.; sold by local people;

ChatGPT extracted a plausible Darwin Core record:

I’ve been experimenting with parsing specimen records using the same machine learning approach for bibliographic data (e.g., Citation parsing tool released), perhaps it’s time to rethink that idea.

Correct OCR errors

OCR errors are a well known limitation of the Biodiversity Heritage Library (BHL), see for example OCR Improvements: An Early Analysis. Given that ChatGPT is built on top of a large language model it seems obvious to see how it handles OCR errors. I tried the following snippet of text from BHL page 4312227:

i BREVIORA No. 185

Jiiaii Jr., 16, who always join him and Hedriek in th(> search for
frogs. Iloraeio Mayorga, research assistant to tlie author, has also
been (wtreniely useful, not only in field work, but also in the
tedious curatorial and laboratory work that usually fol- lows. Dr.
John Randall kindly took tlie photouraplis of E. hedricki, and the
Galiiiaues family of San Juan made their El Verde house available as
a (MMiter of activities for the colleetiuff party. To all these
jx’opic, the autlioi- feels (iee])ly indebted.

Elet’tiierodactylus iiKnRrcKi s]). n.

Type. Museum of Coniparative Zoology Xo. ;}()!)()8, c^ from El Verde,
west flank of El Yuiuinc Puerto Hico. I.IOO ft. Coll. J. A. Kivero,
11 Aug-. I!)(i2.

ChatGPT returns this:

i BREVIORA No. 185

Juan Jr., 16, who always joins him and Hedrick in the search for
frogs. Horacio Mayorga, research assistant to the author, has also
been extremely useful, not only in field work, but also in the tedious
curatorial and laboratory work that usually follows. Dr. John Randall
kindly took the photographs of E. hedricki, and the Galinanes family
of San Juan made their El Verde house available as a center of
activities for the collecting party. To all these people, the author
feels deeply indebted.

Eleutherodactylus hedricki sp. n.

Type. Museum of Comparative Zoology No. 30908, c? from El Verde, west
flank of El Yunque, Puerto Rico. 1100 ft. Coll. J. A. Rivero, 11 Aug.

Comparing this to the scanned image ChatGPT it does pretty well, for example the gobbledegook “Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus hedricki”. Running all of BHL through ChatGPT probably isn’t feasible, but one could imagine targeted cleaning of key papers.


These small experiments are fairly trivial, but they are the sort of tedious tasks that would otherwise require significant programming (or other resources) to solve. But ChatGPT can do rather more, as I hope to discuss in the next post.

Written with StackEdit.