Wednesday, October 24, 2018

Specimens, collections, researchers, and publications: towards social and citation graphs for natural history collections

Being in Ottawa last week for a hackathon meant I could catch up with David Shorthouse (@dpsSpiders. David has been doing some neat work on linking specimens to identifiers for researchers, such as ORCIDs, and tracking citations of specimens in the literature.

David's Bloodhound tool processes lots of GBIF data for occurrences with names of those who collected or identified specimens. If you have an ORCID (and if you are a researcher you really should) then you can "claim" your specimens simply by logging in with your ORCID. My modest profile lists New Zealand crabs I collected while an undergraduate at Auckland University.

Screenshot 2018 10 24 18 11

Unlike many biodiversity projects, Bloodhound is aimed squarely at individual researchers, it provides a means for you to show your contribution collecting and identifying the world's biodiversity. This raises the possibility of one day being able to add this information to your ORCID profile (in the way that currently ORCID can record your publications, data sets, and other work attached to a DOI). As David explains:

A significant contributing factor for this apparent neglect is the lack of a professional reward system; one that articulates and quantifies the breadth and depth of activities and expertise required to collect and identify specimens, maintain them, digitize their labels, mobilize the data, and enhance these data as errors and omissions are identified by stakeholders. If people throughout the full value-chain in natural history collections received professional credit for their efforts, ideally recognized by their administrators and funding bodies, they would prioritize traditionally unrewarded tasks and could convincingly self-advocate. Proper methods of attribution at both the individual and institutional level are essential.

Attribution at institutional level is an ongoing theme for natural history collections: how do they successfully demonstrate the value of their collections?

Mark Carnall's (@mark_carnall) tweet illustrates the mismatch between a modern world of interconnected data and the reality of museums trying to track usage of their collections by requesting reprints. The idea of tracking citations of specimens and or collections has been around for a while. For example, I did some work text mining BioStor for museum specimen codes, Ross Mounce and Aime Rankin have worked on tracking citations of Natural History Museum specimens (, and there is the clever use of Google Scholar by Winker and Withrow (see The impact of museum collections: one collection ≈ one Nobel Prize and

David has developed a nice tool that shows citations of specimens and/or collections from the Canadian Museum of Nature.

Screenshot 2018 10 24 14 10

I'm sure many natural history collections would love a tool like this!

Note the "doughnuts" showing the attention each publication is receiving. These doughnuts are possible only because the publishing industry got together and adopted the same identifier system (DOIs). The existence of persistent identifiers enables a whole ecosystem to emerge based around those identifiers (and services to support those identifiers).

The biodiversity community has failed to achieve something similar, despite several attempts. Part of the problem is the cargo-cult obsession with "identifiers" rather than focussing on the bigger picture. So we have various attempts to create identifiers for specimens (see "Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens" for a review), but little thought given to how to build an ecosystem around those identifiers. We seem doomed to recreate all the painful steps publishers went through as created a menagerie of identifiers (e.g., SICIs, PII) and alternative linking strategies ("just in time" versus "just in case") until they settled on managed identifiers (DOIs) with centralised discovery tools (provided by CrossRef).

Specimen-level identifiers are potentially very useful, especially for cross linking records in GBIF, GenBank, and BOLD, as well as tracking citations, but not every taxonomic community has a history of citing specimens individually. Hence we may also want count citations at collection and institutional level. Once again we run into the issue that we lack persistent, widely used identifiers. The GRBio project to assign such identifiers has died, despite appeals to the community for support (see GRBio: A Call for Community Curation - what community?). Given Wikidata's growing role as an identity broker, a sensible strategy might be to focus on having every collection and institution in Wikidata (many are already) and add the relevant identifiers there. For example, Index Herbarium codes are now a recognised property in Wikidata, as seen in the entry for Cambridge University Herbarium (CGE).

But we will need more than technical solutions, we will also need compelling drivers to track specimen and collection use. The success of CrossRef has been due in part to the network effects inherent in the citation graph. Each publisher has a vested interest in using DOIs because other CrossRef members will include those DOIs in the list of literature cited, which means that each publisher potentially gets traffic from other members. Companies like (of doughnut fame) make money by selling data on attention papers receive to publishers and academic institutions, based on tracking mention of identifiers. Perhaps natural history collections should follow their lead and ask how they can get an equivalent system, in other words, how do we scale tools such as the Canadian Museum of Nature citation tracker across the whole network? And in particular, what services do you want and how much would those services be worth to you?