Tuesday, August 30, 2016

GRBio: A Call for Community Curation - what community?

Singlefig 98253 jpg David Schindel and colleagues recently published a paper in the Biodiversity Data Journal:

Schindel, D., Miller, S., Trizna, M., Graham, E., & Crane, A. (2016, August 26). The Global Registry of Biodiversity Repositories: A Call for Community Curation. BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.4.e10293

The paper is a call for the community to help grow a database (GRBio) on biodiversity repositories, a database that will "will require community input and curation".

Reading this, I'm struck by the lack of a clear sense of what that community might be. In particular: who is this database for, and who is most likely to build it? I suspect that these are two different sets of people.

Who is it for?

It strikes me that the primary target for GrBio is people who care about cleaning up and linking data. This is very small set of people. While cleaned data is nice, and cleaned and linked data is great, by itself it's not much use until it finds its way into useful tools. Why would taxonomists, curators, and other people working with biodiversity data care about GrBio? What can it give them? Ultimately, we'd like things such as the ability to find any specimen in the world online using simply it's museum collection code. We'd like to track usage of that code in other databases, such as GenBank and BOLD, and in the primary literature. These are all nice things, but they won't happen simply because we have a curated list of natural history collections.

Who will curate it?

Arguably there are only two active communities who care about the contents of GrBio on a scale to actually contribute. One is GBIF, which is building its own registry of collections as more and more natural history collections move their collections online. GBIF's registry is primarily for digital access points to collection data, which don't necessarily readily map to physical collections listed by GrBio. If GrBo is to be relevant, it needs to have mappings between its data and GBIF.

But the community that I suspect will really care about this, to the point that they'd actively engage in editing the data, is not the biodiversity community. Rather, it's people who edit Wikipedia and Wikidata.

I was at one of the GrBio workshops and gave a short presentation, which included this slide:

If you search for a major museum on Google, on the right you will often see a rich "knowledge panel" giving much of the information that GrBio wants to capture (museum name, location, etc.), often with a link to the Wikipedia page for that institution (see http://g.co/kg/m/01t372 for a detailed view of the knowledge panel for the NHM). GrBio can't compete with Wikipedia for richness of content (just think of all the Wikipedia pages in languages other than English). Google's database is mostly hidden, but we can get some of the same data from Wikidata, e.g. the Natural History Museum is entity Q309388 on Wikidata.

From my perspective, the smart move to make is not to appeal to an overstretched community of biodiversity researchers, many of whom are suffering from "project fatigue" as yet more acronyms compete for their attention. Instead, position the project as adding to the already existing database of natural history institutions that is growing in Wikidata. GrBio could either link to Wikidata and Wikipedia pages for institutions, or simply move it's data editing efforts to Wikidata and have GrBio be (at most) a search interface to that data. The notion of having a separate database for the world's collections might not be the best way to achieve GrBio's goals. A lot of people involved in both Wikipedia and Wikidata care about cultural institutions (of which natural history museums and herbaria are examples), those are the people GrBio should be engaging with.