iPhylo: April 2020

Roderic D. M. Page

Monday, April 20, 2020

Making sense of how Wikidata models taxonomy

Given my renewed enthusiasm for Wikidata, I'm trying to get my head around the way that Wikidata models biological taxonomy. As a first pass, here's a diagram of the properties linked to a taxonomic name. The model is fairly comprehensive, it includes relationships between names (e.g, basionym, protonym, replacement), between taxa (e.g., parent taxon), and links to the literature. It's also a complex model to query, given that a lot of information is expressed using qualifiers. Hence there's a bit of head scratching while I figure out the relationship between properties, statements, etc.

Links to the literature is one of my interests, can in cases where Wikidata has this information you can start to enhance the way we display publications, e.g.

Wow, that’s great! Hope it wasn’t too tedious a slog. Oh, and I saw this list of cicadas linked to a Fauna of NZ publication https://t.co/goIG9Lr7Gs - I’m assuming you made those links? Nice example of the potential to enhance publications on @Wikidata pic.twitter.com/iDsv4YgnkF
— Roderic Page (@rdmpage) April 19, 2020

The Wikidata model is very like that used in Darwin Core, where everything is a taxon and every taxon has a name, which means that relationships that are notionally between names and not taxa (e.g., basionym) are all treated as relationships between taxa.

One big challenge is how to interpret Wikidata as a classification, given that we expect classifications to be trees. The taxonomic classification in Wikidata is clearly not a tree, for example:

Hmmm, so @wikidata has a rather *complicated* biological taxonomy that is certainly not a tree. Here is the parent - child structure for the frog family Leptodactylidae. Instead of a single path from tip to root, we have all sorts of detours #crowdsourced pic.twitter.com/zu03KvLgnG
— Roderic Page (@rdmpage) April 4, 2020

What I think is happening here is that different people are adding different parent taxa, depending on which classification they follow. Some classifications (e.g., that used by GBIF) are "shallow" with only a few levels (e.g., kingdom, phylum, class, order, family, genus), other classifications are deep (e.g., NCBI). So the idea of simply being able to do a SPARQL query and get a tree (e.g. Displaying taxonomic classifications from Wikidata using d3js and SPARQL) runs into problems. But this could also be a strength, particularly if we had a reference or source for each parent child pair. That way we could (a) store multiple classifications in Wikidata, and (b) have queries that retreive classifications according to a particular source (e.g., GBIF).

So, lots of potential, but lots I've still to learn.

Friday, April 17, 2020

A planetary computer for Earth

Came across Microsoft's announcement of a "A planetary computer for a sustainable future through the power of AI", complete with a glossy video featuring Lucas Joppa @lucasjoppa (see also @Microsoft_Green and #AIforEarth).

On the one hand it's great to see super smart people with lots of resources tackling important questions, but it's hard not to escape the feeling that this is the classic technology company approach of framing difficult problems in ways that match the solutions they have to offer. Is the reason that biodiversity is declining simply because we have lacked computational resources, that our AI isn't good enough? And while forests that have been stripped of both their mega fauna and previous human inhabitants make for photogenic backdrops, biodiversity can be a lot messier (and dangerous). Still, it will be interesting to see how this plays out, and what sort of problems the planetary computer is used to tackle.

Monday, April 13, 2020

Wikidata and the bibliography of life in the time of coronavirus

I haven't posted on iPhylo for a while, and since my last post back in January things have obviously changed quite a bit. In late January and early February I was teaching a course on biodiversity informatics, and students discovered the John Hopkins coronavirus dashboard, which seemed like a cool way to display information on a situation that was happening on the other side of the world. All fairly abstract.

Today the dashboard looks rather different, and things are no longer quite so abstract (and, of course, never were for the population of Wuhan).

At the same time as the pandemic is affecting so many lives (and taking those of people who had a big impact on my childhood), there is the extraordinary case of open access champion Jon Tennant (@protohedgehog). On April 8th I received an item from his email newsletter entitled Converting adversity into productivity, detailing how he'd managed to get through a traumatic period prior to corona virus, and how productive he had managed to be (his email lists a whole slew of articles he'd written). The next day, this:

@Protohedgehog
I am deeply sad to announce that at 1am today Jon was reported to have been in a motorbike accident in Bali and has tragically died. We are so sad to loose someone so special to us. Thank you to everyone who has been a good friend to him, we will miss him terribly
— Rebecca tennant (@Rebeccatennan10) April 9, 2020

The day before, this happened:

Oh, this is so terribly sad. Norm was my hero. https://t.co/Zcz5CpZ7S2
— David Shorthouse (@dpsSpiders) April 8, 2020

Times like this tend to focus the mind, and for anyone with research skills the question arises "what should I be doing?". Some people are addressing issues directly or indirectly relate to the pandemic. It feels like every second post on Medium features someone playing data scientist with coronavirus data. Others are taking existing tools and projects and looking for ways to make them relevant to the problem, such as Plazi and Pensoft seeking to improve access to the biology of corona virus hosts, as part of their broader mission to make biodiversity information more accessible.

Plazi and @Pensoft join forces to let #biodiversity knowledge of #coronaviruses hosts out https://t.co/gVUIOS6GEV https://t.co/XUAdTRuSp1
— plazi (@plazi_ch) April 10, 2020

Another approach, in some ways what Jon Tennant did, is to use the time to focus on what you think matters and work on that. Of course, this assumes that you are fortunate enough to have the time and resources to do that. I have tenure and my children are grown up, life would be very different without a salary or with small children or other dependents.

One of the things I am increasingly focussing on is the idea of Wikidata as the "bibliography of life". Specifically, I want to get as much taxonomic and related literature into Wikidata, and want to link as much of that to freely-available versions of that literature (e.g., on Internet Archive), I want that literature embedded in the citation graph, linked to authors, and linked to the taxa treated in those papers. A lot of literature is already going into Wikidata via bots that consume the stream of papers with CrossRef DOIs and upload their details to Wikidata, but there is a huge corpus of literature that this approach overlooks. Not only do we have Digital libraries like the Biodiversity Heritage Library and JSTOR, but there is a long tail of small publishers making taxonomic literature available online, and I want this to all be equally discoverable.

One aspect of this project is to populate Wikidata with this missing literature. Over the years as part of projects such as BioNames and BioStor I have accumulated hundreds of thousands of bibliographic references. These aren't much use sitting on my hard drive. Adding them to Wikidata makes them more accessible, and also enables others to make them much richer. For example, the irrepressible @SiobhanLeachman regularly converts author strings to author things:

Cool! Bring it. 😁
— Siobhan (@SiobhanLeachman) April 1, 2020

Adding things to Wikidata is fun, but it can be a struggle to get a sense of what is in Wikidata and how it is interconnected. So I've started to build a simple app that helps show me people, publications, journals, and taxa in a fairly conventional way, all powered by Wikidata. The app is live at https://alec-demo.herokuapp.com. It is not going to win any prizes for performance or design, but I find it useful.

Partly I'm trying to make the original articles more accessible, e.g.:

Here's another example, this time a @Naturalis_Sci journal with a first sentence of the abstract and full text displayed (with links back to source) https://t.co/wk5QmLrLvJ pic.twitter.com/SKqGSDWupF
— Roderic Page (@rdmpage) April 6, 2020

I'm keen to link taxonomists to their publications and ultimately the taxa they work on:

I've doubled the number of Norman Platnick's publications in @Wikidata, mostly by adding articles from @BritishSpiders and @AAS_arachnology journals. Profile here https://t.co/7cYe7wcLGh, his @WDScholia page also looking better https://t.co/TxXEUn7FHC pic.twitter.com/1ajZegn5uJ
— Roderic Page (@rdmpage) April 10, 2020

And we can link taxa and publications visually:

I'm slowly getting my head around the way @wikidata models taxonomic names, so that I can link publications to taxa and vice versa, e.g. Sulawesimetopus henryi https://t.co/048knNANjs and https://t.co/jjSX98aTW7 pic.twitter.com/XyrRXQw6sN
— Roderic Page (@rdmpage) April 8, 2020

The community-based, somewhat chaotic consensus-driven approach of Wikidata can be frustrating ("well, if you'd asked ME, I wouldn't have done it that way"), but I think it's time to accept that this is simply the nature of the beast, and marvel at the notion that we have a globally accessible and editable knowledge graph. We can stay in our domain-specific silos, where we can control things but remain bereft of both users and contributors. However if we are willing to let go of that control, and accept that things won't always be done the way we think would be optimal, there is a lot of freedom to be gained by deferring to Wikidata's community decisions and simply getting on with building the bibliography of life. Maybe that is something worthwhile to do in this time of coronavirus.