iPhylo: millipedes

Roderic D. M. Page

Showing posts with label millipedes. Show all posts

Thursday, May 02, 2013

GBIF data quality: visualising Mesibov's millipedes

Bob Mesibov (who has been a guest author on this blog) recently published a paper on data quality in in ZooKeys:

Mesibov, R. (2013). A specialist’s audit of aggregated occurrence records. ZooKeys, 293(0), 1–18. doi:10.3897/zookeys.293.5111

In this paper Bob documents some significant discrepancies between data in his Millipedes of Australia (MoA) database and the equivalent data in the Atlas of Living Australia and GBIF (disclosure, I was a reviewer of the paper, and also sit on GBIF's science committee). This paper spawned a thread on TAXACOM, and also came up at the GBIF meeting I was at earlier this week.

One thing lacking from the discussion is a clear sense of just how big are the discrepancies between GBIF and MoA data, so I grabbed the data provided by Bob (http://dx.doi.org/10.3897/zookeys.293.5111.app and extracted the records where GBIF and MoA disagreed. I converted these to GeoJSON and threw them on Google Maps:

Mesibov2

You can see a live version here http://bl.ocks.org/rdmpage/raw/5501293/ (it can take a little while for the map to appear). I've connected the MoA and GBIF localities for the same occurrence by a straight line, and the the MoA records are encircled by an estimate of their uncertainty (for many records the circle is invisible at this scale).

There are some fairly spectacular discrepancies, and a lot of relatively small scale displacements of records. Does this matter? The answer to this question will depend on what people want to do with the data. You may regard the discrepancies as serious (certainly it's interesting that there are so many differences between the two data sets), or minor given the geographic scale. But visualising them at least makes it possible to form a judgement.

Saturday, August 23, 2008

Reasons text mining will fail. I. UTM Grid References and GenBank accession numbers

OMG. Playing with extracting identifiers from text, I have a regular expression for GenBank accession numbers that looks something like this:
(A[A-Z])[0-9]{6} | (U[0-9]){5} | (D[A-Z])[0-9]{6} | (E[A-Z])[0-9]{6} | (NC_)[0-9]{6}).
OK, it won't get everything, but what is more worrying are the things it will pickup that aren't GenBank accession numbers.

For example, I ran Robert Mesibov's 2005 paper "The millipede genus Lissodesmus Chamberlin, 1920 (Diplopoda: Polydesmida:
Dalodesmidae) from Tasmania and Victoria, with descriptions of a new genus and 24 new species" [PDF here] through a script, and out came loads of GenBank accession numbers ... which is a worry as there aren't any sequences in this paper.

Turns out, Mesibov uses UTM grid references to describe localities, and these look like just GenBank accessions. There is a nice web site here which describes how UTM grid references are determined in Tasmania (from which the image below is taken).

Not all the "accession numbers" in Mesibov(2005) exist in GenBank, but some do, for example grid reference DQ402119 (41°26'31''S 146°17'02''E) is also a sequence DQ402119 and, you guessed it, it's not from a millipede. So, I need to be a little bit careful in extracting identifiers from text.