iPhylo: material examined

Roderic D. M. Page

Showing posts with label material examined. Show all posts

Friday, May 27, 2022

Round trip from identifiers to citations and back again

Note to self (basically rewriting last year's Finding citations of specimens).

Bibliographic data supports going from identifier to citation string and back again, so we can do a "round trip."

1.

Given a DOI we can get structured data with a simple HTTP fetch, then use a tool such as citation.js to convert that data into a human-readable string in a variety of formats.

Identifier	⟶	Structured data	⟶	Human readable string
10.7717/peerj-cs.214	HTTP with content-negotiation	CSL-JSON	CSL templates	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

2.

Going in the reverse direction (string to identifier) is a little more challenging. In the "old days" a typical strategy was to attempt to parse the citation string into structured data (see AnyStyle for a nice example of this), then we could extract a truple of (journal, volume, starting page) and use that to query CrossRef to find if there was an article with that tuple, which gave us the DOI.

Identifier	⟵	Structured data	⟵	Human readable string
10.7717/peerj-cs.214	OpenURL query	journal, volume, start page	Citation parser	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

3.

Another strategy is to take all the citations strings for each DOI, index those in a search engine, then just use a simple search to find the best match to your citation string, and hence the DOI. This is what https://search.crossref.org does.

Identifier	⟵	Human readable string
10.7717/peerj-cs.214	search	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

At the moment my work on material citations (i.e., lists of specimens in taxonomic papers) is focussing on 1 (generating citations from specimen data in GBIF) and 2 (parsing citations into structured data).

Tuesday, April 21, 2015

Looking up specimen codes in GBIF using Google Spreadsheet

Playing with the my "material examined" tool I've been working on, I wondered whether I could make use of it in, say, a spreadsheet. Imagine that I have a spreadsheet of museum codes and want to look those up in GBIF. I could create a service for Open Refine but Open Refine is a bit big and clunky, you have to fire up a Java application and point your browser at it, and Open Refine isn't as intuitive or as flexible as a spreadsheet.

It turns that Google Spreadsheets supports custom functions, including importing JSDON from a remote data source. Following How to import JSON data into Google Spreadsheets in less than 5 minutes here's what to do:

Create a new Google Spreadsheet.
Click on Tools -> Script Editor.
Click Create script for Spreadsheet.
Delete the placeholder content and paste the code from this script.
Rename the script to ImportJSON.gs and click the save button.
Back in the spreadsheet, in a cell, you can type “=ImportJSON()” and begin filling out it’s parameters.

Lets imagine we have a spreadsheet with a specimen code in cell A1, e.g. "FMNH 187122".

To call the material examined service, we need a function like this:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A1,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

Paste this into cell B1 (i.e., just to the right of the specimen code) and after a short delay you should see something like this:

The three parameters supplied to ImportJSON are are the query URL, written as a spreadsheet function that grabs the specimen code from cell A1, a list of the bits of data we want to extract from the result (expressed as JSON paths), and some options (in this case, don't show the headers). ImportJSON will grab the specimen code in cell A1, add it to the query URL, then output the results. You should see something like this:

The first column is the GBIF occurrence ID, the second is the scientific name (you can add more JSON paths to get more fields).

Note that we have multiple rows as there is more than one specimen with the code "FMNH 187122" in GBIF. Now, we can ask the material examined service to return only certain taxa (such as mammals) by adding the "scientificName" parameter:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A10,"&scientificName=",B10,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

If you put the specimen code in cell A10, and the higher taxon "Mammalia" in cell B10, and paste the function above into cell C10, then you should see something like this:

Note that now we have a single row with the mammal specimen.

It's a little bit fussy (you need to get the ImportJSON script, and mess a bit with the parameters but it's quick and flexible, and you get all the power of a spreadsheet to help clean the data before trying to match it to GBIF. Plus you can do it all in your browser.