Thursday, August 18, 2016

GBIF Challenge: €31,000 in prizes for analysing and addressing gaps and biases in primary biodiversity data

Full widthIn a classic paper Boggs (1949) appealed for an “atlas of ignorance”, an honest assessment of what we know we don’t know:

Boggs, S. W.. (1949). An Atlas of Ignorance: A Needed Stimulus to Honest Thinking and Hard Work. Proceedings of the American Philosophical Society, 93(3), 253–258. Retrieved from http://www.jstor.org/stable/3143475

This is the theme of this year's GBIF Challenge: Analysing and addressing gaps and biases in primary biodiversity data. "Gaps" can be gaps in geographic coverage, taxa group, or types of data. GBIF is looking for ways to access the nature of the gaps in the data it is aggregating from its network of contributors.

How to enter

Details on how to enter are on the Challenge website, deadline is September 30th.

Ideas

One approach to gap analysis is to compare what we expect to see with what we actually have. For example, we might take a “well-known” group of organisms and use that to benchmark GBIF’s data coverage. A drawback is that the “well-known” organisms tend to be the usual suspects (birds, mammals, fish, etc.), and there is the issue of whether the chosen group is a useful proxy for other taxa. Another approach is to base the estimate of ignorance on the data itself. For example, OBIS has computed Hurlbert's index of biodiversity for its database, e.g. http://data.unep-wcmc.org/datasets/16 Screenshot 2016 08 18 15 13 59 Can we scale these methods to the 600+ million records in GBIF? There are some clever methods for using resampling methods (such as the bootstrap) on large data sets that might be relevant, see http://www.unofficialgoogledatascience.com/2015/08/an-introduction-to-poisson-bootstrap_26.html.

Another approach might be to compare different datasets for the same taxa, particularly if one data set is not in GBIF. Or perhaps we can compare datasets for the same taxa collected by different methods.

Or we could look at taxonomic gaps. In an earlier post The Zika virus, GBIF, and the missing mosquitoes I noted that GBIF's coverage of vectors of the Zika virus was very poor. How well does GBIF cover vectors and other organisms relevant to human health? Maybe we could generalise this to explore other taxa. It might, for example, be interesting to compare degree of coverage for a species with some measure of the "importance" of that species. Measures of importance could be based on, say, number of hits in Google Scholar for that species, size of Wikipedia page (see Wikipedia mammals and the power law), etc.

Gaps might also be gaps in data completeness, quality, or type.

Summary

This post has barely scratched the surface of what is possible. But I think one important thing to bear in mind is that the best analyses of gaps are those that lead to "actionable insights", in other words, if you are going to enter the challenge (and please do, it's free to enter and there's money to be won), how does you entry help GBIF and the wider biodiversity community decide what to do about gaps?