Friday, July 23, 2021

Species Cite: linking scientific names to publications and taxonomists

I've made Species Cite live. This is a web site I've been working on with the GBIF Challenge as a notional deadline so I'll actually get something out the door.

"Species Cite" takes as its inspiration the suggestion that citing original taxonomic descriptions (and subsequent revisions) would increase citation metrics for taxonomists, and give them the credit they deserve. Regardless of the merits of this idea, it is difficult to implement because we don’t have an easy way of discovering which paper we should cite. Species Cite tackles this by combining millions of taxonomic name records linked to LSIDs with bibliographic data from Wikidata to make it easier to cite the sources of taxonomic names. Where possible it provides access to PDFs for articles using Internet Archive, or Unpaywall. These can be displayed in an embedded PDF viewer. Given the original motivation of surfacing the work of taxonomists, Species Cite also attempts to display information about the authors of a taxonomic paper, such as ORCID and/or Wikidata identifiers, and an avatar for the author via either Wikidata or ResearchGate. This enables us to get closer to the kind of social interface found in citizen science projects like iNaturalist where participants are people with personalities, not simply strings of text. Furthermore by identifying people and associating them with taxa it could help us discover who are the experts on particular taxonomic groups, and also enable those people to easily establish that they are, in fact, experts.

How it works

Under the hood there's a lot of moving pieces. The taxonomic names come from a decade or more of scraping LSIDs from various taxonomic databases, primarily ION, IPNI, Index Fungorum, and Nomenclator Zoologicus. Given that these LSIDs are often offline I built caches one and two to make them accessible (see It's been a while...).

The bibliographic data is stored in Wikidata, and I've built an app to explore that data (see Wikidata and the bibliography of life in the time of coronavirus) and also a simple search engine to find things quickly (see Towards a WikiCite search engine). I've also spent way more than I'd care to admit adding taxonomic literature to Wikidata and Internet Archive.

The map between names and literature his based on work I've done with BioNames and various unpublished projects.

To make things a bit more visually interesting I've used images of taxa from Phylopic, and also harvested images from ResearchGate to supplement the rather limited number of images of taxonomists in Wikidata.

One of the things I've tried to do is avoid making new databases, as those often die from neglect. Hence the use of Wikidata for bibliographic data. The taxonomic data is held in static files in the LSID caches. The mapping between names and publications is a single (large) tab-delimited file that is searched on disk using a crude binary search on a sorted list of taxonomic names. This means you can download the Github repository and be up and running without installing a database. Likewise the LSID caches use static (albeit compressed) files. The only database directly involved is the WikiCite search engine.

Once all the moving bits come together, you can start to display things like a plant species together with it's original description and the taxonomists who decsribed that species (Garcinia nuntasaenii):

What's next

There is still so much to do. I need to add lots of taxonomic literature from my own BioStor project and other sources, and the bibliographic data in Wikidata needs constant tending and improving (which is happening, see Preprint on Wikidata and the bibliography of life). And at some point I need to think about how to get the links between names and literature into Wikidata.

Anyway, the web site is live at https://species-cite.herokuapp.com.

Update

I've created a 5 min screencast walking you through the site.