Quick notes on another example of data duplication in GBIF. I'm in the process of building a tool to map specimen codes to GBIF records, and came across the following example. Consider the specimen code "AM M.22320", which is the voucher for the sequence KJ532444 (GenBank gives the voucher as M22320, but the associated paper doi:10.1016/j.ympev.2014.03.009 clarifies that this specimen comes from the Australian Museum). Locating this specimen in GBIF I found a series of records that were identical except for the catalogNumbers, which looked like this: M.22320.001, M.22320.002, etc. What gives?
Initially I thought this may be a simple case of data duplication (maybe the suffixes represent different versions of the same record?). Then I managed to locate the records on the Australian Museum web site:.
- M.22320.009 - Wet Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.008 - Skull Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.001 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.005 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.006 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.007 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.003 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.004 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.002 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
- M.22320.010 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990
Turns out we have 10 records for "M.22320", which include various preparations and tissue samples. The holotype specimen for Pteralopex taki (originally described in doi:10.1071/AM01145, see BioNames) has generated 10 different records., all of which have ended up in GBIF. Anyone using GBIF occurrence data and interpreting the number of occurrence records as a measure of how abundant an organism is at a given locality is clearly going to be misled by data like this.
One way to tackle this problem would be if GBIF (or the data provider) could cluster the records that represent the "same" specimen, so GBIF doesn't end up duplicating the same information (in this case, 10-fold). The Australian Museum records don't seem to specify a direct link between the 10 records. I then located the same records in OZCAM, the data provider that feeds GBIF. Here is the OZCAM record for "M.22320.001": http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064. OZCAM doesn't have the information on whether the record is a skull, a wet preparation, or a tissue sample, that information has been lost, and hence doesn't make it as far as GBIF.
Note that OZCAM has resolvable identifiers for each specimen in the form of UUIDs that are appended to "http://ozcam.ala.org.au/occurrence/". The corresponding UUIDs are included in the Darwin Core dump that OZCAM makes available to GBIF. Here they are for the parts of M.22320:
"223d1549-1322-419e-8af4-649a4b145064","M.22320.001",... "c40a7eea-6e04-4be6-8dcb-4473402e48c4","M.22320.002",... "21fcaea1-c645-49d9-9753-dbd9dd2bc64a","M.22320.003",... "34ffd935-9fb4-44a5-acb8-2cd4df5ade62","M.22320.004",... "03635fb8-f9ac-4c4c-898b-859cd42f1e26","M.22320.005",... "a1c4dd5a-dc03-45cc-8971-931c739df8b2","M.22320.006",... "71c91030-405c-4390-8ec3-42a5478a2fd8","M.22320.007",... "0f1a9326-34d0-4fb2-b89a-9856bd9082f0","M.22320.008",... "86270ef7-07f6-4395-84c7-66d5d497cc01","M.22320.009",...
But when GBIF parses the dump it ignores these UUIDs, which means the GBIF user can't easily go to the OZCAM site (which has a bunch of other useful information, compare http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064 with http://www.gbif.org/occurrence/774916561/verbatim ). It also means that GBIF has stripped out an identifier that we might make use of to unambiguously refer to each record (and, presumably, this UUID doesn't change between harvests of OZCAM data).
In summary, this is a bit of a mess: we have multiple records that are really just bits of the same specimen but which are not linked together by any data provider, and as the data is transmitted up the chain to GBIF clues as to what is going on are stripped out. For a user like me who is trying to link the GenBank sequence to its voucher this is frustrating, and ultimately all rather avoidable if we took just a little more care in how we represent data about specimens, and how we treat that data as it gets transmitted between data bases.