Friday, March 18, 2016

The Plant List, GBIF, and the primary literature

TL;DR; The Plant List is now in GBIF

Readers of this blog may recall that I've had a somewhat jaundiced view of The Plant List. The first version was release with a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license which allowed copying so long as didn't create a derived work (The Plant List: nice data, shame it's not open). This is frankly about the silliest possible license for a data set as, from my perspective, the whole reason for releasing data is so that it can be combined and enhanced with other data.

The second release (version 1.1) dropped an explicit CC license in favour of almost the reverse position (!). You can't copy the list "as is" without permission, but you can make derivative works "without prior written permission from us" (see Terms of Use for The Plant List). Progress, of a sort.

So, for the last week I've been working on getting a version of The Plant List into GBIF, and I've finally managed to achieve this. There's isn't a single place you can grab the whole plant list, so you have to scrape the web site for CSV files, then glue them together. I would could argue that converting the data into the Darwin Core Archive is a derived work, but in case this seems not derivative enough (of course, nobody seems ready to define just what "derived" actually means) I started to augment the list of names by adding bibliographic identifiers. I've long argued (see e.g. Surfacing the deep data of taxonomy) that a fundamental limitation of existing taxonomic database is that they don't explicitly link to the primary literature. This is why I built BioNames, and why I've been working to link the "micro citations" in IPNI to identifiers such as DOIs, JSTOR likes, BioStor URLs and BHL page links (see project on github). So, I've added about 120,000 DOIs and JSTOR links to names in the plant list. This is a subset of the links I've found for IPNI, but for this first release I've tried to keep things simple. I've also made the link between Plant List name and DOI/JSTOR via the IPNI identifier for a name, and the Plant List has ommitted quite a few IPNI ids for reasons which aren't clear.

The Plant List version I've created is available in GBIF ( and Having another list of plant names will be a useful addition to the checklists that GBIF already has, even if the Plant List is already somewhat out of date.


One feature of enhanced Plant List in GBIF is that for a subset of names (currently about 10%) there are direct links to the original publication of that name. For example, the record for Haniffia albiflora in the Plant List has a fairly cryptic bibliographic citation Nordic J. Bot. 20: 287 2000 and no link to that publication. In the version I've uploaded to GBIF the name Haniffia albiflora looks like this: Haniffia Note the full citation. But more importantly, the Publisher record link is the DOI so clicking on it takes you to the original description of this species: Doi4 There is a lot of plant taxonomic literature available in JSTOR, sadly most of it (along with specimen images) behind a paywall (see Why are botanists locking away their data in JSTOR Plant Science?). Some of the links from GBIF take you to JSTOR: Doi3 The DOI landscape is evolving, and there are now multiple DOI registration agencies minting DOIs for scientific papers. CrossRef provides easily the best services for discovery and metadata harvesting, other agencies often have no equivalent, which makes it hard to discover DOIs for those papers hard. I've spent some time getting this information for Chinese and Taiwanese articles, e.g. Doi1 and Doi2 to give two example of articles that are now linked to from the corresponding species page in GBIF.

It's all about the links To reiterate, I believe that one of the key challenges facing biodiversity informatics is cross linking between disparate types of data and source of information. At the moment most of our data resides in disconnected silos. The links I'm adding to plant names are a small step, but they can lead to all sorts of possibilities. For example, users of GBIF can click on a link and see the original paper. If, for example, GBIF doesn't have a map for the species discussed in that paper, it's likely that the paper may have some information (e.g., the type locality). If users click on the links, then that is going to drive more traffic to the original literature, thus increasing its visibility. Furthermore, now that we have a taxon identifier (from GBIF) linked to a bibliographic identifier, we can go in the opposite direction. Earlier I proposed a Javascript bookmarklet as a way to augment the information on a web page (see Rethinking annotating biodiversity data). We could have a popup on a article web page that can tell the user about the taxa mentioned in that paper. If GBIF has a ma for those taxa, we can immediately place that paper in a geospatial context (e.g., Africa). This is barely scratching the surface of what is possible once we start breaking out of silos and share deeply linked data.