Friday, May 28, 2021

Finding citations of specimens

Note to self.

The challenge of finding specimen citations in papers keeps coming around. It seems that this is basically the same problem as finding citations to papers, and can be approached in much the same way.

If you want to build a database of reference from scratch, one way is to scrape citations from papers (e.g., from the "literature cited" section), convert those strings into structured data, and add those to your database. In the early days of bibliographic searching this was a common strategy (and I still use it to help populate Wikidata).

Regular expressions are powerful but also brittle, you need to keep tweaking them to accommodate all the minor ways citation styles can differ. This leads to more sophisticated (and hopefully robust) approaches, such as machine learning. Conditional random fields (CRF) are a popular technique, pioneered by tools like Parscite and most recently used in the very elegant anystyle.io. You paste in a citation string and you get back that citation with all the component parts (authors, title, journal, pagination, etc.) separated out. Approaches like this require training data to teach the parser how to recognise the parts of a citation string. One obvious way to generate training data is to have a large bibliographic database, a set of "style sheets" describing all the ways different journals represent citations (e.g., citationstyles.org), and then you can generate lots of training data.

Over time the need for citation parsing has declined somewhat, being replaced by simple fulltext search (exemplified by this Tweet);

Again, in the early days a common method of bibliographic search was to search by keys such as journal name (or ISSN), volume number, and starting page. So you had to atomise the reference into its parts, then search for something that matched those parts. This is tedious (OpenURL anyone?), but helps reduce false matches. If you only have a small bibliographic database searching for reference by string matching can be frustrating because you are likely to get lots of matches, but none of them to the reference you are actually looking for. Given how search works you'll pretty much always get some sort of match. What really helps is if the database has the answer to your search (this is one reason Google is so great, if you have indexed the whole web chances are you have the answer somewhere already). Now that CrossRef's database has grown much larger you can search for a reference using a simple string search and be reasonably confident of getting the a genuine hit. The need to atomise a reference for searching is disappearing.

So, armed with a good database and good search tools we can avoid parsing references. Search also opens up other possibilities, such as finding citations using full text search. Given a reference how do you find where it's been cited? One approach is to parse the text of a reference (A), extract the papers in the "literature cited" section (B, C, D, etc.), match those to a database, and add the "A cites B", "A cites C", etc. links to the database. This will answer "what papers does A cite?" but not "what other papers cite C?". One approach to that question would be to simply take the reference C, convert it to a citation string, then blast through all the full text you could find looking for matches to that citation string - these are likely to be papers that cite reference C. In other words, you are finding that string in the "literature cited" section of the citing papers.

So, to summarise:

  1. To recognise and extract citations as structured data from text we can use regular expressions and/or machine learning.
  2. Training data for machine learning can be generated from existing bibliographic data coupled with rules for generating citation strings.
  3. As bibliographic databases grow in size the need for extracting and parsing citations diminishes. Our databases will have most of the citations already, so that using search is enough to find what we want.
  4. To build a citation database we can parse the literature cited section and extract all references cited by a paper ("X cites")
  5. Another approach to building a citation database is to tackle the reverse question, namely "X is cited by". This can be done by a full text search for citation strings corresponding to X.

How does this relate to citing specimens you ask? Well, I think the parallels are very close:

  • We could use CRF approaches to have something like anystyle.io for specimens. Paste in a specimen from the "Materials examined" section of a paper and have it resolved into its component parts (e.g., collector, locality, date).
  • We have a LOT of training data in the form of GBIF. Just download data in Darwin Core format, apply various rules for how specimens are cited in the literature, and we have our training data.
  • Using our specimen parser we could process the "Materials examined" section of a paper to find the specimens (Plazi extracts specimens from papers, although it's not clear to me how automated this is.)
  • We could also do the reverse: take a Darwin Core Archive for, say, a single institution, generate all the specimen citation strings you'd expect to see people use in their papers, then go search through the full text of papers (e.g., in PubMed Central and BHL) looking for those strings - those are citations of your specimens.

There seems a lot of scope for learning from the experience of people working with bibliographic citations, especially how to build parsers, and the role that "stylesheets" could play in helping to understand how people cite specimens. Obviously, a lot of this would be unnecessary if there was a culture of using and citing persistent identifiers for specimens, but we seem to be a long way from that just yet.