Wednesday, June 11, 2008

More GBIF errors, courtesy of FishBase

Resurrecting iSpecies after moving it to a new folder on one of my servers, and browsing popular searches, I keep coming across clearly erroneous distributions. FishBase seems a major culprit. For example, the common pandora Pagellus erythrinus is a marine fish, yet GBIF displays numerous occurrences in mainland Africa (dots with black centre on map below).

What gives? Well, after struggling with the somewhat non-intuitive GBIF web site I found that the erroneous records are from FishBase. As for the frog example I blogged about earlier, the actual records have locality information indicating most of the records come from the Mediterranean, but the latitude and longitudes are reversed. Swapping these, the records show a more believable distribution (white dots on SVG map below). If you don't see the map, use a decent web browser such as Safari 3 or Firefox 2. If you must use Internet Explorer, grab the RENESIS player.


Error, browser must support "SVG"




I know I've harped on about this before, but surely the time is ripe for some clever data cleaning? Especially if users start to loose their trust in GBIF.



7 comments:

Meredith said...

Hey, Rod, how's it going?

We used to harp on certain of GBIF's characteristics a lot, and that has kind of dwindled off, lately.

However, it is time to bring them up again:
1) GBIF is a mediator of data that belong to the data providers. It promises the providers that it won't "mess with their data".
2) The data providers own the data; therefore maintaining and cleaning it is their responsibility.
3) GBIF provides cleaning tools, and feedback about particular problems with record sets (if the users take advantage of this) to the data providers.
4) Data maintenance is a big job, but if everybody helps (i.e. users provide feedback to providers), it can become manageable.

Cheers,
Meredith

David Shorthouse said...

I have mixed feelings on this too. As a data provider of providers, I do what I can to help folks immediately view potential problems with their data and do not release any data via DiGIR unless nomenclature and geocoding are cleaned as much as can be done at my end. A major help would be not the passivity of DiGIR/Tapir, but an additional, more "active" approach whereby comments, flags and indeed web service results from reverse geocoding and nomenclature could be pushed into a temp table at the provider's end. Hook it up to an RSS feed producer and you have an opportunity for real-time feedback.

Tim Robertson said...
This comment has been removed by the author.
Tim Robertson said...

This is interesting David and Rod, and the new GBIF provider wrapper (ETA end 2008) will go a long way to address these.

Firstly the wrapper will allow browsing and searching taxonomically and geospatially (e.g. maps by species) on the provider data directly, so after installation, it should be pretty obvious to those who know their data of errors. It will also incorporate feedback through pingback or similar - but how far do you think you can go in automating the cleansing? Are you really going to allow people to change your coordinates without some form of security?

I would propose the dream scenario is that through custom 'pingbacks', the wrapper installation records that someone suggests inverting the LAT/LONG for example. The provider is given some notification (email/RSS), and they can log in to the wrapper and accept the proposed change at the click of a button.

Of course, this only works for wrappers that sit on top of a LIVE DB. Often, the wrapper is running on a local cache and then you loose the synchronisation to the LIVE DB they are using.

Finally the new provider tool will offer simple index views to the data, and allow the provider to call really simple web services to georeference, report on quality etc. It will become very simple for people to put up these services due to the simpler transfer format for simplified views.

Can you elaborate Dave on your thoughts please?

Roderic Page said...

I guess what I would like is for GBIF to flag potential errors, so that even if there is a delay in the provider fixing things, users can see which data points are potentially ropey. Given the number of GBIF providers, and the fact that at no time are all of them online (see BigDig) I'm not confident that errors would be fixed promptly (the DiGIR provider may be a cached copy of the original data, not everybody devotes resources to fixing database errors, etc.).

I'm all in favour of a ping-back mechanism, but I also see great value in GBIF flagging potential errors, partly to provide feedback to users. The fact that errors have persisted to this day suggests that the current mechanism of getting feedback and fixing errors isn't working.

Tim Robertson said...

Hi Rod,
I am not sure if you have seen the event viewer

This is a work in progress, but you can see processing issues on a per-provider, or per-resource basis in the portal at the moment.

Currently only geospatial and taxonomic quality issues are flagged, and you are aware already that GBIF only check for coordinates inside a loose bounding box for a country (problematic for marine). If there were known distributions available, we could also flag against those.

Meredith said...

Thanks, Tim, for filling in the technical details about how GBIF does check data when it is registered, etc., and the tools we do make available.

And, I want to point out that while GBIF refrains from "messing with the data" in part for political reasons, real reasons include that it is intended to share the load of data cleansing by keeping it spread out among the data providers.

The store of metadata that GBIF keeps about the datasets is growing, though, and as Tim points out, there will be ways for the user to skip a dodgy record.

GBIF does not have sufficient (nor even beginning to approach enough) funds to be the curator of everybody's data. So as desirable as Rod's wish for GBIF to provide cleaned data is, the best that it can do is try to mediate enough data that there will be high signal-to-noise ratio if people use it without performing some data cleansing themselves, OR expect users to cleanse their data and help the providers by providing feedback, AND providing data cleaning tools and feed back to providers. Only the first of these is as yet unrealized.