Friday, April 15, 2016

The Zika virus, GBIF, and the missing mosquitoes

One of GBIF's goals is to provide up to date, comprehensive data on the distribution of species. Although GBIF's taxonomy and geographic scope is global, not all species are equal, in the sense that the need for information on some species is potentially much more pressing. An example are mosquitoes of the genus Aedes, such as the species A. aegypti and A. albopictus that spread the Zika virus.

Over the last few days I discovered how poor GBIF's coverage of these two vectors is, and a way to fix that gap quickly. Like many things I work on, I stumbled across the problem by accident. GBIF has released a report on whether GBIF data are fit for modeling species distributions. The publicity material included a psychedelic image showing a map for Aedes aegypti from a recent eLife paper by Kraemer et al. (The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus ).

Moritz et al 2015 Global Aedes aegypti distribution detail2

Curious, I read the paper and the phrase "GBIF" occurs only once in the text:

we selected 10,000 occurrence records of Aedes species from the Global Biodiversity Information Facility (, omitting all records of Ae. aegypti and Ae. albopictus. This dataset is intended to reflect biases in mosquito reporting in areas which are suitable for Aedes mosquitoes.

So, GBIF data on these two mosquitoes wasn't used. A quick look at what GBIF had for Aedes albopictus and it's not surprising why GBIF data played such a small role:


Compare this with the data shown in the Scientific Data paper ( on the data that underpins the eLife paper.

Sdata201535 f3

Note the striking lack of any GBIF records from Brazil. Fortunately the data collected by Kraemer et al. are freely available in Dryad, so I grabbed the files, fussed about with them a bit ( to get them into the format required by GBIF, and uploaded them. Below is the data for Aedes albopictus in GBIF:

1651430 updated

This is looking more like it! If you are more interested in Aedes aegypti then that data is also available.


This example raises a number of questions:

  1. How come GBIF had such poor data to start with? If GBIF is going to be relevant to people who need biodiversity data, in some cases urgently, then there's an argument to be made that GBIF should be targeting species such as disease vectors that are likely to be in demand in the future.
  2. Why wasn't the latest data in GBIF? One reason GBIF's data was poor is that the relevant data was widely scattered in the literature (Kraemer et al. list over 1000 papers that they looked at, not including the unpublished sources). This clearly requires a lot of effort to assemble. But once assembled, why wasn't it deposited in GBIF? Is it a case of researchers not thinking this would be a useful thing to do, or not knowing how to do it?
  3. What about all the other data out there? This particular example was prompted by me wondering what is that hideous image on the GBIF post, reading the eLife article, wondering where the data was, and having sufficient access to GBIF to simply upload the data. This is clearly not a scalable approach. How can we improve this process? Can we automate harvesting relevant data from repositories such as Dryad so that this data gets fed into GBIF automatically? Should GBIF become a data repository itself so authors can store their data there? And how do we retrospectively harvest all the rest of the data languishing in the scientific literature?

Side note

One aspect of the Kraemer et al. data I've not focussed on is that it is derived from the literature, most of it unpublished, but some is in the primary literature (the list of papers is missing from the Dryad repository but I obtained a copy from Moritz Kraemer (@MOUGK and it's now on github). This means we can link individual occurrence records back to the evidence for that occurrence (i.e., the paper that made the assertion that this species of mosquito is found at this locality). This means we can (a) provide provenance for the data, and (b) provide credit to the authors of that observation. I hope to explore this topic in a subsequent blog post.


Kraemer, M. U. G., Sinka, M. E., Duda, K. A., Mylne, A., Shearer, F. M., Brady, O. J., … Hay, S. I. (2015, July 7). The global compendium of Aedes aegypti and Ae. albopictus occurrence. Scientific Data. Nature Publishing Group.

Kraemer, Moritz U. G., Sinka, Marianne E., Duda, Kirsten A., Mylne, Adrian, Shearer, Freya M., Brady, Oliver J., … Hay, Simon I. (2015). Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence. Dryad Digital Repository.

Kraemer, M. U., Sinka, M. E., Duda, K. A., Mylne, A. Q., Shearer, F. M., Barker, C. M., … Hay, S. I. (2015, June 30). The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus . eLife. eLife Sciences Organisation, Ltd.