Tuesday, July 18, 2023

What, if anything, is the Biodiversity Knowledge Hub?

How to cite: Page, R. (2023). What, if anything, is the Biodiversity Knowledge Hub? https://doi.org/10.59350/axeqb-q2w27

To much fanfare BiCIKL launched the “Biodiversity Knowledge Hub” (see Biodiversity Knowledge Hub is online!!!). This is advertised as a “game-changer in scientific research”. The snappy video in the launch tweet claims that the hub will

  • it will help your research thanks to interlinked data…
  • …and responds to complex queries with the services provided…

Interlinked data, complex queries, this all sounds very impressive. The video invites us to “Vist the Biodiversity Knowledge Hub and give it a shot”. So I did.

The first thing that strikes me is the following:

Disclaimer: The partner Organisations and Research Infrastructures are fully responsible for the provision and maintenance of services they present through BKH. All enquiries about a particular service should be sent directly to its provider.

If the organisation trumpeting a new tool takes no responsibility for that tool, then that is a red flag. To me it implies that they are not taking this seriously, they have no skin in the game. If this work mattered you’d have a vested interest in seeing that it actually worked.

I then tried to make sense of what the Hub is an what it offers.

Is it maybe an aggregation? Imagine diversity biodiversity datasets linked together in a single place so that we could seamlessly query across that data, bouncing from taxa to sequences to ecology and more. GBIF and GenBank are examples of aggregations here data is brought together, cleaned, reconciled, and services built on top of that. You can go to GBIF and get distribution data for a species, you can go to GenBank and compare your sequence with millions of others. Is the Hub an aggregation?.. no, it is not.

Is it afederation? Maybe instead of merging data from multiple sources, that data lives on the original sites, but we can query across it a bit like a travel search engine queries across multiple airlines to find us the best flight. The data still needs to be reconciled, or at least share identifiers and vocabularies. Is the Hub a federation?.. no, it is not.

OK, so maybe we still have data in separate silos, but maybe the Hub is a data catalogue where we can search for data using text terms (a bit like Google’s Dataset Search)? Or even better, maybe it describes the data in machine readable terms so that we could find out what data are relevant to our interests (e.g., what data sets deal with taxa and ecological associations base don sequence data?). Is it a data catalogue? … no, it is not.

OK, then what actually is it?

It is a list. They built a list. If you go to FAIR DATA PLACE you see an invitation to EXPLORE LINKED DATA. Sounds inviting (“linked data, oohhh”) but it’s a list of a few projects: ChecklistBank, e-Biodiv, LifeBlock, OpenBiodiv, PlutoF, Biodiversity PMC, Biotic Interactions Browser, SIBiLS SPARQL Endpoint, Synospecies, and TreatmentBank.

These are not in any way connected, they all have distinct APIs, different query endpoints, speak different languages (e.g., REST, SPARQL), and there’s no indication that they share identifiers even if they overlap in content. How can I query across these? How can I determine whether any of these are relevant to my interests? What is the point in providing SPARQL endpoints (e.g., OpenBiodiv, SIBiLS, Synospecies) without giving the user any clue as to what they contain, what vocabularies they use, what identifiers, etc.?

The overall impression is of a bunch of tools with varying levels of sophistication stuck together on a web page. This is in no way a “game-changer”, nor is it “interlinked data”, nor is there any indication of how it supports “complex queries”.

It feels very much like the sort of thing one cobbles together as a demo when applying for funding. “Look at all these disconnected resources we have, give us money and we can join them together”. Instead it is being promoted as an actual product.

Instead of the hyperbole, why not tackle the real challenges here? At a minimum we need to know how each service describes data, those services should use the same vocabularies and identifiers for the same things, be able to tell us what entities and relationships they cover, and we should be able to query across them. This all involves hard work, obviously, so let’s stop pretending that it doesn’t and do that work, rather than claim that a list of web sites is a “game-changer”.

Written with StackEdit.