Monday, August 20, 2018

GBIF Challenge Entry: Ozymandias

I've submitted an entry for the 2018 GBIF Ebbe Nielsen Challenge. It's a couple of weeks before the deadline but I will be away then so have decided to submit early.

My entry is Ozymandias - a biodiversity knowledge graph. The name is a play on "Oz" being nickname for Australia (much of the data for the entry comes from Australia), and Ozymandias, which is a poem about hubris, and attempting to link biodiversity data requires a certain degree of hubris.

The submission process for the challenge is unfortunately rather opaque compared to previous years when entries were visible to all, so participants could see what other people were submitting, and also knew the identity of the judges, etc. In the spirit of openness here is my video summarising my entry:

Ozymandias - GBIF Challenge Entry from Roderic Page on Vimeo.

There is also a background document here:

I suspect this entry is not at all what the challenge is looking for, but I've used the challenge as a deadline so that I get something out the door rather than endlessly tweaking a project that only I can see. There will, of course, be endless tweaking as I explore further ways to link data, but at least this way there is something people can look at. Now, I need to spend some time writing up the project, which will require yet more self discipline to avoid the endless tweaking.

Friday, August 17, 2018

Ozymandias demo

I've made a video walkthrough of Ozymandias, which I described in this post. It's a bit, um, long, so I'll need to come up with a shorter version.

Ozymandias - a biodiversity knowledge graph from Roderic Page on Vimeo.

Friday, August 10, 2018

Ozymandias: a biodiversity knowledge graph of Australian taxa and taxonomic publications

In the spirit of release early and release often, here is the first workable version of a biodiversity knowledge graph that I've been working on for Australian animals (for some background on knowledge graphs see Towards a biodiversity knowledge graph now in RIO). The core of this knowledge graph is a classification of animals from the Atlas of Living Australia (ALA) combined with data on taxonomic names and publications from the Australian Faunal Directory (AFD). This has been enhanced by adding lots of digital identifiers (such as DOIs) to the publications and, where possible, full text either as PDFs or as page scans from the Biodiversity Heritage Library (BHL) (provided via BioStor). Identifiers enable us to further grow the knowledge graph, for example by adding "cites" and "cited by" links between publications (data from CrossRef), and displaying figures from the Biodiversity Literature Repository (BLR).

The demo is here: If you’re looking for starting points, you could try:

Assassin spiders (images from Plazi and citation data from CrossRef)

Screenshot 2018 08 10 17 44

Memoirs of Museum Victoria (dynamic query finds record in Wikidata and adds map)

Screenshot 2018 08 10 17 47

G. R. Allen (we can from the taxonomic tree of his top 20 taxa that he studies fish - who knew?)

Screenshot 2018 08 10 17 47

Paper on mosquito taxonomy with lots of citations, including material in BHL/BioStor

Screenshot 2018 08 10 17 47

Paper on Australian flies with full text in BioStor

Screenshot 2018 08 10 17 59

The focus for now is on taxa, publications, journals, and people. Occurrences and sequences are on the “to do” list. As always there’s lots of data cleaning and cross linking to do, but an obvious next step is to link people’s names to identifiers such as ORCID and Wikidata ids, so that we can trace the activities of taxonomists as they discover and describe Australian biodiversity (the choice of Australia is simply to keep things manageable, and because the amount of data and digitisation they’ve done is pretty extraordinary). I’m also working to a deadline as I'm trying to get this demo wrapped up in the next couple of weeks.

Technical details

TL;DR the knowledge graph is implemented as a triple store where the data has been represented using a small number of vocabularies (mostly with some terms borrowed from TAXREF-LD and the TDWG LSID vocabularies). All results displayed in the first two panels are the result of SPARQL queries, the content in the rightmost panel comes from calls to external APIs. Search is implemented using Elasticsearch. If you are feeling brave you can query the knowledge graph directly in SPARQL. I’m constantly tweaking things and adding data and identifiers, so things are likely to break. More details and documentation will be going up on the GitHub repository.