iPhylo: crowdsourcing

Roderic D. M. Page

Showing posts with label crowdsourcing. Show all posts

Friday, September 28, 2012

Reading the Biodiversity Heritage Library using Readmill

tl;dr Readmill might be a great platform for shared annotation and correction of Biodiversity Heritage Library content.

Thinking about accessing the taxonomic literature I started revisiting previous ideas. One is DeepDyve (see DeepDyve - renting scientific articles). Imagine not having to pay large sums for an article, but being able to rent it. Yes, open access would be great, but ultimately it's all a question of money (who pays and when), the challenge is to find the mix of models that encourage people to digitise the relevant literature. Instead of publishers insisting we pay $US30 for an article, how about renting it for the short time we actually need to read it?

Another model is unglue.it, a Kickstarter-like company that seeks to raise funds to digitise and make freely available e-Books. unglue.it has campaigns where people pledge donations, and if sufficient pledges are made the book's rights-holder has the book digitised and released DRM-free.

Looking at unglue.it I stumbled across Readmill, "a curious community of readers, highlighting and sharing the books they love." Readmill has an iPad app where you can highlight passages of text and add your own annotation. These annotations can be shared, and multiple people can read and comment on the same book. Imagine doing this on BHL content. You could highlight parts of the text where the OCR has failed, and provide a correction. You could highlight taxonomic names that automatic parsers have missed, geographic localities, cited literature, etc. All within a nice, social app.

Even better, Readmill has an API. You can retrieve highlights and comments on those highlights. So, if someone flags a sentence as mangled OCR and provides a correction, that correction could be harvested and feed back to, say, BHL. These corrections could be used to improve searches, as well as the text delivered when generating searchable PDFs, etc.

You can even add highlights via the API, so we could upload a ePub book then add all the taxonomic names found by uBio or NetiNeti, enabling users to see which bits of text are probably names, correcting any mistakes along the way. Instead of giving readers a blank canvas they could already have annotations to start with.

Building an app from scratch to read and annotate BHL content would be a major undertaking. From my cursory initial look I wonder if Readmill might just provide the platform we need to clean up and annotate key parts of the BHL corpus?

Thursday, June 28, 2012

Where is the "crowd" in crowdsourcing? Mapping EOL Flickr photos

In any discussion of data gathering or data cleaning the term "crowdsourcing" inevitably comes up. A example where this approach has been successful is the Encyclopedia of Life's Flickr pool, where Flickr users upload images that are harvested by EOL.

Given that many Flickr photos are taken with cameras that have built-in GPS (such as the iPhone, the most common camera on Flickr) we could potentially use the Flickr photos not only as a source of images of living things, but to supplement existing distributional data. For example, Flickr has enough data to fairly accurately construct outlines of countries, cities, and neighbourhoods, see The Shape of Alpha, so what about organismal distribution?

This question is part of a Masters project by Jonathan McLatchie here at Glasgow, comparing distributions of taxa in GBIF with those based on Flickr photos. As part of that project the question arose "where are the Flickr photos being taken?" If most of the photos are being taken in the developed world, then there are at least two problems. The first is the obvious bias against organisms that live elsewhere (i.e., typically many photos won't be taken in those regions where you'd actually like to get more data). Secondly, the presence of zoos, wildlife parks, and botanical gardens means you are likely to get images of organisms well outside their natural range.

Jonathan suggested a "heatmap" of the Flickr photos would help, so to create this I wrote a script to grab metadata for the photos from the Encyclopedia of Life's Flickr pool, extract latitude and longitude, and draw the resulting locations on a map. I aggregated the points into 1°×1° squares, and generated a GBIF-style map of the photos:

Screenshot

Lots of photos from North America, Europe, and Australasia, as one might expect. Coverage of the rest of the globe is somewhat patchy. I guess the key question to ask is extent the "crowd" (Flickr users in this case) is essentially replicating the sampling biases already in projects like GBIF that are aggregating data from museum collections (most of which are in the developed world).

The PHP code to fetch the photo data and create the map is available in github. You'll need a Flickr API key to run the script. The github repository has an SVG version of the map (with a bitmap background). A bitmap copy of the map is available on FigShare http://dx.doi.org/10.6084/m9.figshare.92668.