Friday, July 23, 2021

Species Cite: linking scientific names to publications and taxonomists

I've made Species Cite live. This is a web site I've been working on with the GBIF Challenge as a notional deadline so I'll actually get something out the door.

"Species Cite" takes as its inspiration the suggestion that citing original taxonomic descriptions (and subsequent revisions) would increase citation metrics for taxonomists, and give them the credit they deserve. Regardless of the merits of this idea, it is difficult to implement because we don’t have an easy way of discovering which paper we should cite. Species Cite tackles this by combining millions of taxonomic name records linked to LSIDs with bibliographic data from Wikidata to make it easier to cite the sources of taxonomic names. Where possible it provides access to PDFs for articles using Internet Archive, or Unpaywall. These can be displayed in an embedded PDF viewer. Given the original motivation of surfacing the work of taxonomists, Species Cite also attempts to display information about the authors of a taxonomic paper, such as ORCID and/or Wikidata identifiers, and an avatar for the author via either Wikidata or ResearchGate. This enables us to get closer to the kind of social interface found in citizen science projects like iNaturalist where participants are people with personalities, not simply strings of text. Furthermore by identifying people and associating them with taxa it could help us discover who are the experts on particular taxonomic groups, and also enable those people to easily establish that they are, in fact, experts.

How it works

Under the hood there's a lot of moving pieces. The taxonomic names come from a decade or more of scraping LSIDs from various taxonomic databases, primarily ION, IPNI, Index Fungorum, and Nomenclator Zoologicus. Given that these LSIDs are often offline I built caches one and two to make them accessible (see It's been a while...).

The bibliographic data is stored in Wikidata, and I've built an app to explore that data (see Wikidata and the bibliography of life in the time of coronavirus) and also a simple search engine to find things quickly (see Towards a WikiCite search engine). I've also spent way more than I'd care to admit adding taxonomic literature to Wikidata and Internet Archive.

The map between names and literature his based on work I've done with BioNames and various unpublished projects.

To make things a bit more visually interesting I've used images of taxa from Phylopic, and also harvested images from ResearchGate to supplement the rather limited number of images of taxonomists in Wikidata.

One of the things I've tried to do is avoid making new databases, as those often die from neglect. Hence the use of Wikidata for bibliographic data. The taxonomic data is held in static files in the LSID caches. The mapping between names and publications is a single (large) tab-delimited file that is searched on disk using a crude binary search on a sorted list of taxonomic names. This means you can download the Github repository and be up and running without installing a database. Likewise the LSID caches use static (albeit compressed) files. The only database directly involved is the WikiCite search engine.

Once all the moving bits come together, you can start to display things like a plant species together with it's original description and the taxonomists who decsribed that species (Garcinia nuntasaenii):

What's next

There is still so much to do. I need to add lots of taxonomic literature from my own BioStor project and other sources, and the bibliographic data in Wikidata needs constant tending and improving (which is happening, see Preprint on Wikidata and the bibliography of life). And at some point I need to think about how to get the links between names and literature into Wikidata.

Anyway, the web site is live at https://species-cite.herokuapp.com.

Update

I've created a 5 min screencast walking you through the site.

Thursday, July 22, 2021

Towards a WikiCite search engine

I've released a simple search engine for publications in Wikidata. Wikicite Search takes its name from the WikiCite project, which was an initiative to create a bibliographic database in Wikidata. Since bibliographic data is a core component of taxonomic research (arguably taxonomy is mostly tracing the fate of the "tags" we call taxonomic names) I've spent some time getting taxonomic literature into Wikidata. Since there are bots already adding articles by harvesting sources such as CrossRef and PubMed, I've focussed on literature that is harder to add, such as articles with non-CrossRef DOIs, or those without DOIs at all.

Once you have a big database, you are then faced with the challenge of finding things in that database. Wikidata supports generic search, but I wanted something more closely geared to bibliographic data. Hence Wikicite Search. Over the last few years I've made several attempts at a bibliographic search engine, for this project I've finally settled on some basic ideas:

  1. The core data structure is CSL-JSON, a simple but rich JSON format for expressing bibliographic data.
  2. The search engine is Elasticsearch. The documents I upload include the CSL-JSON for an article, but also a simple text representation of the article. This text representation may include multiple languages if, for example, the article has a title in more than one language. This means that if an article has both English and Chinese titles you can find it searching in either language.
  3. The web interface is very simple: search for a term, get results. If the search term is a Wikidata identifier you get just the corresponding article, e.g. Q98715368.
  4. There is a reconciliation API to help match articles to the database. Paste in one citation per line and you get back matches (if found) for each citation.
  5. Where possible I display a link to a PDF of the article, which is typically stored in the Internet Archive or accessible via the Wayback Machine.

There are millions of publications in Wikidata, currently less than half a million are in my search engine. My focus is narrowly on eukaryote taxonomy and related topics. I will be adding more articles as time permits. I also periodically reload existing articles to capture updates to the metadata made by the Wikidata community - being a wiki the data in Wikidata is constantly evolving.

My goal is to have a simple search tool that focusses on matching citation strings. In other words, it is designed to find a reference you are looking for, rather than be a tool to search the taxonomic literature. If that sounds contradictory, consider that my tool will only find a paper about a taxon if it is explicitly named in the title. A more sophisticated search engine would support things like synonym resolution, etc.

The other reason I built this is to provide an API for accessing Wikidata items and displaying them in other formats. For example, an article in the WikiCite search engine can be retrieved in CSL-JSON format, or in RDF as JSON-LD.

As always, it's very early days. But I don't think it's unreasonable to imagine that as Wikidata grows we could envisage having a search engine that includes the bulk of the taxonomic literature.

Citation parsing tool released

Quick note on a tool I've been working on to parse citations, that is to take a series of strings such as:

  • Möllendorff O (1894) On a collection of land-shells from the Samui Islands, Gulf of Siam. Proceedings of the Zoological Society of London, 1894: 146–156.
  • de Morgan J (1885) Mollusques terrestres & fluviatiles du royaume de Pérak et des pays voisins (Presqúile Malaise). Bulletin de la Société Zoologique de France, 10: 353–249.
  • Morlet L (1889) Catalogue des coquilles recueillies, par M. Pavie dans le Cambodge et le Royaume de Siam, et description ďespèces nouvelles (1). Journal de Conchyliologie, 37: 121–199.
  • Naggs F (1997) William Benson and the early study of land snails in British India and Ceylon. Archives of Natural History, 24:37–88.

and return structured data. This is an old problem, and pretty much a "solved" problem. See for example AnyStyle. I've played with AnyStyle and it's great, but I had to install it on my computer rather than simply use it as a web service. I also wanted to explore the approach a bit more as a possible a model for finding citations of specimens.

After trying to install the underlying conditional random fields (CRF) engine used by AnyStyle and running into a bunch of errors, I switched to a tool I could get working, namely CRF++. After figuring out how to compiling a C++ application to run on Heroku I started to wonder how to use this as the basis of a citation parser. Fortunately, I had used the Perl-based ParsCit years ago, and managed to convert the relevant bits to PHP and build a simple web service around it.

Although I've abandoned the Ruby-based AnyStyle I do use AnyStyle's XML format for the training data. I also built a crude editor to create small training data sets that uses a technique published by the author of the blogging tool I'm using to write this post (see MarsEdit Live Source Preview). Typically I use this to correctly annotate examples where the parser failed. Over time I add these to the training data and the performance gets better.

This is pretty much a side project or a side project, but ultimately the goal is to employ it to help extract citation data from publications, both to generate data to populate (BioStor), and also start to flesh out the citation graph for publications in Wikidata.

If you want to play with the tool it is at https://citation-parser.herokuapp.com. At the moment it takes some citation strings and returns the result in CSL-JSON, which is becoming the default way to represent structured bibliographic data. Code is on GitHub.