Inspired by the Benoît Fontaine et al. paper on the lag time between a species being discovered and subsequently described (see Species wait 21 years to be described - show me the data) I thought I would do a quick and dirty plot of the difference between the year a specimen was collected and the year the name of the taxon it belongs to was published (from the authorship string for the scientific name). Plotting the results was *cough* interesting:
In theory, the difference between the two dates should be negative (if you subtract publication year from collection year), the smaller number the less the wait for description. But I found some large positive numbers, implying that taxa had been described long before the types were discovered! Something is clearly wrong. What seems to be happening here is the GBIF has failed to match the species name for an occurrence, and so goes up the taxonomic hierarchy and just records the genus. For example, http://gbif.org/occurrence/472764211 was collected in 1965 and is the type of Pandanus guadalcanalius St.John. GBIF doesn't recognise this name, and so matches the occurrence to the genus Pandanus Linnaeus, 1782. hence it looks like we've used a time machine to describe a taxon in 1782 based on a specimen from 1965.
At the other end of the spectrum, there are a lot of specimens that seem to have waited over 200 years for description! Turns out these are mostly specimens from the MCZ that have their collection date recorded by GBIF as "1700-01-01". This seems an arbitrary date, and turns out it's an artefact. The MCZ records "unknown" collection dates as the range 1700-01-01 - 2100-01-01
(see http://mczbase.mcz.harvard.edu/guid/MCZ:IZ:DIPL-4985). Unfortunately, when it generates the export for GBIF, these get truncated to 1700-01-01, and GBIF then (not unreasonably) treats that as the actual collection date. Somewhere in the middle of the plot of lag between collection and description is some interesting information, but it's a pity that most of this is obscured by some serious data errors.
For me the bigger lesson here is the power of visualisation to explore the data and to expose errors. This is why I was underwhelmed by the new charts GBIF is releasing. Plots of ever upward trends are ultimately not very useful. They don't give much insight into the data, nor do they help tackle interesting questions. I think we need a much richer set of visualisations to really understand the strengths and limitations of the data in GBIF.
Update
Investigating further, there are some other reasons for the "back to the future" types. For example, http://www.gbif.org/occurrence/188826624 (CAS 5506 from FishBase) was collected in 1933 and is recorded as a holotype, with the scientific name Cypselurus opisthopus (Bleeker, 1865). 1933 - 1865 = 68, so the taxon was named 68 years before it was collected(!).
A bit of investigation using BioNames, BioStor, and GBIF (http://www.gbif.org/occurrence/473244692, another record for CAS 5506) reveals that CAS 5506 is the holotype for Cypselurus crockeri, shown below in a plate from it's original description (published in 1935):
Seale A (1935) The Templeton Crocker Expedition to western Polynesian and Melanesian islands, 1933. No. 27. Fishes. Proceedings of the California Academy of Sciences 21: 337–378. http://biostor.org/reference/59326
So, in fact this species was described shortly after its collection, with a lag of 1933 - 1935 = -2 years.
Apart from the duplication issue (FishBase has replicated some of the CAS dataset, sigh), the other problem is one of modelling the data. The CAS record has the original taxon name for which CAS 5506 is the type (Cypselurus crockeri), the FishBase record has the currently accepted name for the taxon (Cypselurus opisthopus). These two different approaches have very different implications for the charts I'm making, and simply reinforce my feeling that the GBIF data is both fascinating and full of "gotchas!".