Thursday, June 28, 2012

Where is the "crowd" in crowdsourcing? Mapping EOL Flickr photos

In any discussion of data gathering or data cleaning the term "crowdsourcing" inevitably comes up. A example where this approach has been successful is the Encyclopedia of Life's Flickr pool, where Flickr users upload images that are harvested by EOL.

Given that many Flickr photos are taken with cameras that have built-in GPS (such as the iPhone, the most common camera on Flickr) we could potentially use the Flickr photos not only as a source of images of living things, but to supplement existing distributional data. For example, Flickr has enough data to fairly accurately construct outlines of countries, cities, and neighbourhoods, see The Shape of Alpha, so what about organismal distribution?

This question is part of a Masters project by Jonathan McLatchie here at Glasgow, comparing distributions of taxa in GBIF with those based on Flickr photos. As part of that project the question arose "where are the Flickr photos being taken?" If most of the photos are being taken in the developed world, then there are at least two problems. The first is the obvious bias against organisms that live elsewhere (i.e., typically many photos won't be taken in those regions where you'd actually like to get more data). Secondly, the presence of zoos, wildlife parks, and botanical gardens means you are likely to get images of organisms well outside their natural range.

Jonathan suggested a "heatmap" of the Flickr photos would help, so to create this I wrote a script to grab metadata for the photos from the Encyclopedia of Life's Flickr pool, extract latitude and longitude, and draw the resulting locations on a map. I aggregated the points into 1°×1° squares, and generated a GBIF-style map of the photos:


Lots of photos from North America, Europe, and Australasia, as one might expect. Coverage of the rest of the globe is somewhat patchy. I guess the key question to ask is extent the "crowd" (Flickr users in this case) is essentially replicating the sampling biases already in projects like GBIF that are aggregating data from museum collections (most of which are in the developed world).

The PHP code to fetch the photo data and create the map is available in github. You'll need a Flickr API key to run the script. The github repository has an SVG version of the map (with a bitmap background). A bitmap copy of the map is available on FigShare