Wednesday, June 04, 2014

Visual analysis of GBIF data

Tim Roberston and the ream at GBIF are working on some nice visualisations of GBIF data, and have made an early release available for viewing: http://analytics.gbif-uat.org. For a given country, say, the Solomon Islands, you can see numerous plots, mostly like this:

Gbif

Ever the critic, as much as I like this (and appreciate the scale of the task underlying doing analytics on data at the scale of GBIF), what I would really like to see is something that more closely resembles Google Analytics. I want graphs that I can use to get some insight into the data, and which lead me to ask questions (and provide easy for me to discover the answers).

So, I put together a crude, live demo of the sort of thing I'd like to see. You can see it at http://bionames.org/~rpage/gbif-stats (can't promise that this link will be long-lived), and below is a screen shot:
Stats
What I've done is fetch all the occurrence records for the Solomon Islands from GBIF (using the API), dumped that into CouchDB, and generated some simple queries. I display the results using Google Charts. There are some similarities with the tools developed by Javier Otegui, Arturo Ariño, and colleagues.

Otegui, J., Ariño, A. H., Encinas, M. A., & Pando, F. (2013, January 25). Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). (G. P. S. Raghava, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0055144
Otegui, J., & Arino, A. H. (2012, August 15). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics. Oxford University Press (OUP). doi:10.1093/bioinformatics/bts359


For fun I've also added a map of the GBIF occurrences (also served from CouchDB).

Here's quick guide to some of the charts. Below you can see (left) a plot of species accumulation over time, that is the total number of species that have been collected up to that time. If we had collected all the species we'd expect this to asymptote (flatten out). If it keeps going up, then we still need to do some sampling. On the right is the number of occurrences recorded for each year. You can see that collecting is highly episodic.

Stats2

To get a little more information on this, I've generated a crude chart where the rows are institutions (e.g., museums and herbaria) that have specimens, and the number of occurrences collected each decade are represented by the shaded boxes (the rightmost box is the current decade) (if you hover over a bar you will see a popup with the decade). To the right is the total number of occurrences.

Chart
From this we can see that there have been some major collections at various times (e.g., Kew in the 1960's, the Australian Museum in the 1970's to 1990's). Strangely, the MCZ has lots of specimens from the 1700's, I suspect we have a data quality issue here. There are certainly some issues with dates in this data set, with about a quarter of occurrences with no date:

Date

Note that the data for the Solomon Islands, comes from all around the world, mostly from the US. There is a big spike in the date of collection curve in 1944, suggesting a lot of material may be the result of collecting by US servicemen in WW2.

Map

I use a treemap to display the taxonomic distribution of the records, and a donut chart to summarise the taxonomic level to which the occurrences are identified:

Taxa
The treemap is dominated by vertebrates, which I suspect is a poor reflection of the actual taxonomic composition of the Solomon Islands biota. Over 3/4 of the occurrences are identified to species level, which is encouraging, but there's clearly a lot of material that needs some taxonomic work.

Where next


This has been made in a rush, and there is a lot which could be done. For example, some of the charts would be more useful if you could drill down and explore further. This could be done via the GBIF API or portal (for example, by constructing a URL that shows the portal results for the Solomon Islands for a given year of collection).

There are, of course, issues of scalability. I've made this for the 83,364 occurrences currently in the GBIF portal for the Solomon Islands. There would need to be some thought given to how this could be scaled to larger data sets. But I think this is worth pursuing so that we can get further insights into the remarkable database that GBIF is building.