Wednesday, April 10, 2019

Ozymandias: A biodiversity knowledge graph published in PeerJ

My paper "Ozymandias: A biodiversity knowledge graph" has been published in PeerJ https://doi.org/10.7717/peerj.6739
The paper describes my entry in GBIF's 2018 Ebbe Nielsen Challenge, which you can explore here. I tweeted about its publication yesterday, and got some interesting responses (and lots of retweets, thanks to everyone for those).

Carl Boettiger (@cboettig) asked where the triples were, as did Kingsley Uyi Idehen (@kidehen). Doh! This is one thing I should have done as part of the paper. I've uploaded the triples to Zenodo, you can find them here: https://doi.org/10.5281/zenodo.2634326.

Donat Agosti (@myrmoteras) complained that my knowledge graph ignored a lot of available information, which is true in the sense that I restricted it to a core of people, publications, taxa, and taxonomic names. The Plazi project that Donat champions extracts, where possible, lots of detail from individual publications, including figures, text blocks corresponding to taxonomic treatments, and in some cases geographic and specimen information. I have included some of this information in Ozymandias, specifically figures for papers where they are available. For example, Figure 10 from the paper "Australian Assassins, Part I: A review of the Assassin Spiders (Araneae, Archaeidae) of mid-eastern Australia":



This figure illustrates Austrarchaea nodosa (Forster, 1956), and Plazi has a treatment of that taxon: http://treatment.plazi.org/id/1072F469192A5BA015A1AA70A36E2C92. This treatment comprises a series of text blocks extracted from the paper, so there is not a great deal I can do with this unless I want to parse the text (e.g., for geographical coordinates and specimen codes). So yes, there is RDF (see http://treatment.plazi.org/GgServer/rdf/1072F469192A5BA015A1AA70A36E2C92) but it adds little to the existing knowledge graph.
To be fair, for some treatments in Plazi are a lot richer, for example http://tb.plazi.org/id/A94487F7E15AFFA5FF682EE9FEB45F2C which has references, geographical coordinates, and more. What would be useful would be an easy way to explore Plazi, for example, if the RDF was dumped into a triple store where we could explore it in more detail. I hope to look into this in the coming weeks.

Sunday, March 24, 2019

Where is the damned collection? Wikidata, GrBio, and a global list of all natural history collections

One of the things the biodiversity informatics community has struggled to do is come up with a list of all natural history collections (Taylor, 2016). Most recently GrBio attempted to do this, and appealed for community help to curate the list (Schindel et al., 2016), but this did not emerge, and at the time of writing GrBio is moribund. GBIF has obtained GrBio's data and is now hosting it (GBIF provides new home for the Global Registry of Scientific Collections) but the problem of curation remains. Furthermore, GrBio is not the only contender for a global list of collections, the NCBI has their own list (Sharma et al. 2018).
When Schindel et al. came out I suggested that a better way forward was to use Wikidata as the data store for basic information on collections (see GRBio: A Call for Community Curation - what community?). David Shorthouse's work on linking individual researchers to the specimens they have collected (Bloodhound) has motivated me to revisit this. One of the things David is wants to do is link the work of individuals to the institutions that host the specimens they work on. For individuals the identifier of choice is ORCID, and many researcher's ORCID profiles have identifiers for the institution they work at. For example, my ORCID profile https://orcid.org/0000-0002-7101-9767 states that I work at Glasgow University which has the Ringgold number of 3526. What is missing here is a way to go from the institutional identifiers we use for specimens (e.g., abbreviations like "MCZ" for the Museum of Comparative Zoology) to identifier such as Ringgold that organisations such as ORCID use.
It turns out that many institutions with Ringgold numbers (and other identifiers, such as Global Research Identifier Database or GRID) are in Wikidata. So, if we could map museum codes (institutionCode in Darwin Core terms) to Wikidata, then we can close the loop and have common institutional identifiers for both where individuals are employed and the institutions that house the collections that they work on.
Hence, it seems to me that using Wikidata as the basis for a global catalogue of institutions housing natural history collections makes a lot of sense. Many of these institutions are already in Wikidata, and the community of Wikidata editors dwarfs the number of people likely to edit a domain-specific database (as evidenced by the failure of GrBio's call for community engagement with its database). Furthermore, Wikidata has a sophisticated editing interface, with support for multiple langages and adding the provenance of individual data entries.
To get a sense of what is already in Wikidata I've built a small tool called Where is the damned collection?. It's a simple wrapper around a SPARL query to Wikidata, and the display is modelled on the "knowledge panel" that you often see to the right of Google's search results. If you type in the acronym for an institution (i.e., the institutionCode) the tools attempts to find it.




Here are some more examples:

There are some challenges to using Wikidata for this purpose. To date there has been little in the way of a coordinated effort to add natural history collections. There are 121 institutions that have a Index Herbariorum code (Property P5858) associated with their Wikidata records, you can see a list here. There is also a property for Biodiversity Repository ID which supports the syntax GrBio used to create unique institutionCode's even when multiple institutions used the same code. This has had limited uptake so far only being a property for five Wikidata items.
However, there are more museums and herbaria in Wikidata. For example, if we search for herbaria, natural history museums, and zoological museums we find 387 institutions. This query is made harder than it should because there are multiple types that can be used to describe a natural history collection and they query only uses three of them.
Another source of entries in Wikidata is Wikispecies. There are two pages (Repositories (A–M) and Repositories (N–Z)) that list pages corresponding to different institutionCodes. I have harvested these and found 1298 of these in Wikidata. This indicates that a good fraction of the 7,097 institutions listed by GrBio already have a presence in Wikidata. At the same time, it rather complicates the task of adding institutions to Wikidata as we need to figure out how many of these stub-like entries based on institutionCodes represent institutions already in Wikidata. There are also https://en.wikipedia.org/wiki/List_of_herbaria and natural history museums on Wikipedia that can also be harvested and cross-referenced with Wikidata.
So, there is a formidable data cleaning task ahead, but I think it's worth contemplating. One thing I find particularly interesting are the links to social media profiles, such as Twitter, Facebook, and Instagram. This gives another perspective on these institutions - in a sense this is digitisation of experiences that one can have at those institutions. These profiles are also often a good sources of data (such as geographic location and address). And they give a foretaste of what I think we can do. Imagine the entire digital footprint of a museum or herbarium being linked together in one place: the social media profiles, the digitised collections, the publications for which it is a publisher, its membership in BHL, JSTOR, GBIF, and other initiatives, and so on. We could start to get a better sense of the impact of digitisation - broadly defined - on each institution.
In summary, I think the role of Wikidata in cataloguing collections is worth exploring, and there's a discussion of this idea going on at the GBIF Community Forum. It will be interesting to see where this discussion goes. Meantime, I'm messing about developing with some scripts to see how much of the data mapping and cleaning process can be automated, so that tools like Where is the damned collection? become more useful.
References
  • Schindel, D., Miller, S., Trizna, M., Graham, E., & Crane, A. (2016). The Global Registry of Biodiversity Repositories: A Call for Community Curation. Biodiversity Data Journal, 4, e10293. doi:10.3897/bdj.4.e10293
  • Sharma, S., Ciufo, S., Starchenko, E., Darji, D., Chlumsky, L., Karsch-Mizrachi, I., & Schoch, C. L. (2018). The NCBI BioCollections Database. Database, 2018. doi:10.1093/database/bay006
  • Taylor, M. A. (2016). “Where is the damned collection?” Charles Davies Sherborn’s listing of named natural science collections and its successors. ZooKeys, 550, 83–106. doi:10.3897/zookeys.550.10073