iPhylo: GBIF Challenge 2017: Liberating species records from open data repositories for scientific discovery and reuse

Roderic D. M. Page

Friday, June 16, 2017

GBIF Challenge 2017: Liberating species records from open data repositories for scientific discovery and reuse

GBIF is running its Ebbe Nielsen Challenge for the third successive year. This year the title is Liberating species records from open data repositories for scientific discovery and reuse. To quote from the Challenge background on Devpost:

This year's Challenge will seek to leverage the growth of open data policies among scientific journals and research funders, which require researchers to make the data underlying their findings publicly available. Adoption of these policies represents an important first step toward increasing openness, transparency and reproducibility across all scientific domains, including biodiversity-related research.

To abide by these requirements, researchers often deposit datasets in public open-access repositories. Potential users are then able to find and access the data through repositories as well as data aggregators like OpenAIRE and DataONE. Many of these datasets are already structured in tables that contain the basic elements of biodiversity information needed to build species occurrence records: scientific names, dates, and geographic locations, among others.

However, the practices adopted by most repositories, funders and journals do not yet encourage the use of standardized formats. This approach significantly limits the interoperability and reuse of these datasets. As a result, the wider reuse of data implied if not stated by many open data policies falls short, even in cases where open licensing designations (like those provided through Creative Commons) seem to encourage it.

In essence, the 2017 Challenge is to develop tools to discover these biodiversity-relevant datasets, and make them available to GBIF. In other words, we want tools to enable us to do this:

As an example of the impact that external data can have on GBIF, last year I wrote a blog post (The Zika virus, GBIF, and the missing mosquitoes) describing how I took published data (doi:10.1038/sdata.2015.35) from the Dryad repository and added it to GBIF. The effect was dramatic:

Before

After

This is just one example. I suspect that there is a lot of biodiversity data gathering digital dust sitting in repositories that could be more widely reused if we just had the tools to discover it, and convert it into a form that GBIF can use. Prove me right, and win cash prizes! Details at https://gbif2017.devpost.com.