Thursday, January 09, 2014

Annotating GBIF: some thoughts

Given that it's the start of a new year, and I have a short window before teaching kicks off in earnest (and I have to revise my phyloinformatics course) I'm playing with a few GBIF-related ideas. One topic which comes up a lot is annotating and correcting errors. There has been some work in this area [1][2] bit it strikes me as somewhat complicated. I'm wondering whether we couldn't try and keep things simple.

From my perspective there are a bunch of problems to tackle. The first is that occurrence data that ends up in GBIF may be incorrect, and it would be nice if GBIF users could (at the very least) flag those errors, and even better fix them if they have the relevant information. For example, it may be clear that a frog apparently in the middle of the ocean is there because latitude and longitudes were swapped, and this could be easily fixed.

Another issue is that data on an occurrence may not be restricted to a single source. It's tempting to think, for example, that the museum housing a specimen has the authoritative data on that specimen, but this need not be the case. Sometimes museums either lack (or decide not to make available) data such as geographic coordinates, but this information is available from other sources (such as the primary literature, or GenBank, see e.g. Linking GBIF and GenBank). Speaking of Genbank, there is a lot of basic biodiversity data in GenBank (such as georeferenced voucher specimens) and it would be great to add that data to GBIF. One issue, however, is that some of the voucher specimens in GenBank will already be in GBIF, potentially creating duplicate records. Ideally each specimen would be represented just once in GBIF, but for a bunch of reasons this is tricky to do (for a start, few specimens have globally unique identifiers, see DOIs for specimens are here, but we're not quite there yet), hence GBIF has duplicate specimen records. So, we are going to have to live with multiple records for the 'same" thing.

Lastly there is the ongoing bugbear that URLs for GBIF occurrences are not stable. This is frustrating in the extreme because it defeats any attempt to link these occurrences to other data (e.g., DNA sequences, the Biodiversity Heritage Library, etc.). If the URLs regularly break then there is little incentive to go to the trouble of creating links between different data bases, and biodiversity data will remain in separate silos.

So, we have three issues: user edits and corrections of data hosted by GBIF, multiple sources of data on the same occurrence, and lack of persistence links to occurrences.

If we accept that the reality is we will always have duplicates, then the challenge becomes how to deal with them. Let's imagine that we have multiple instances of data on the same occurrence, and that we have some way of clustering those records together (e.g., using the specimen code, the Darwin Core Triple, additional taxonomic information, etc.). Given that we have multiple records we may have multiple values for the same item, such as locality, taxon name, geo-coordinates, etc. One way to reconcile these is to use an approach developed for handling bibliographic metadata derived from citations, as described in [3](PDF here). If you are building a bibliographic database from lists of literature cited, you need to cluster the citations that are sufficiently similar to be likely to be the same reference. You might also want to combine those records to yield a best estimate of the metadata for the actual reference (in other words, one author might have cited the article with an abbreviated journal name, another author might have cited only the first page, etc., but all might agree on the volume the article occurs in). Councill et al. use Bayesian belief networks to derive an estimate of the correct metadata.

What is nice about this approach is that you retain all the original data, and you can weight each source by some measure of its reliability (i.e., the "prior"). Hence, we could weight a user's edits based on some measure, such as the acceptance of other edits they've made or, say, their authority (a user who is the author of a taxonomic revision of a group might know quite a bit about the specimens belonging to those taxa). If a user edits a GBIF record (say, but adding latitude and longitude values) we could add that as a "new" record, linked to the original, and containing just the edited values (we could also enable the user to confirm that other values are correct).

So, what do we show regular users of GBIF if we have multiple records for the same occurrence? In effect we compute a "consensus" based on the multiple records, tackling into account the prior probabilities that each source is reliable. What about the museums (or other "providers")? Well, they can grab all the other records (e.g., the user edits, the GenBank information, etc.) and use it to update their records, if they so choose. If they do so, next time GBIF harvest their data, the GBIF version of that data is updated, and we can recompute the new "consensus". It would be nice to have some way of recording whether the other edits/records we accepted, so we can gauge the reliability of those sources (a user whose edits are consistently accepted gets "up voted"). The provider could explicitly tell GBIF which edits it accepted, or we could infer them by comparing the new and old versions.

To retain a version history we'd want to keep the new and old provider records. This could be done using timestamps - any record has a creation date, and an expiry date. By default the expiry date is far in the future, but if a record is replaced it's expiry date is set to that time, and it is ignored when indexing the data.

How does this relate to duplicates? Well, GBIF has a habit of deleting whole sets of data if it indexes data from a provider and that provider has done something foolish, such as change the fields GBIF uses to identify the record (another reason why globally unique identifiers for specimens can't come soon enough). Instead of deleting the old records (and breaking any links to those records) GBIF could simply set their expiry date but keep them hanging around. They would not be used to create consensus records for an occurrence, but if someone used a link that had a now deleted occurrence id they could be redirected to the current cluster that corresponds to that old id, and hence the links would be maintained (albeit pointing to possibly edited data).

This is still a bit half-baked, but I think the challenge GBIF faces is how to make the best of messy data which may lack a single definitive source. The ability for users to correct GBIF-hosted data would be a big step forward, as would the addition of data from Genbank and the primary literature (the later has the advantage that in many cases it will presumably have been scrutinised by experts). The trick is to make this simple enough that there is a realistic chance of it being implemented.


[1] Wang, Z., Dong, H., Kelly, M., Macklin, J. A., Morris, P. J., & Morris, R. A. (2009). Filtered-Push: A Map-Reduce Platform for Collaborative Taxonomic Data Management. 2009 WRI World Congress on Computer Science and Information Engineering (pp. 731–735). Institute of Electrical and Electronics Engineers. doi:10.1109/CSIE.2009.948

[2] Morris, R. A., Dou, L., Hanken, J., Kelly, M., Lowery, D. B., Ludäscher, B., Macklin, J. A., et al. (2013). Semantic Annotation of Mutable Data. (I. N. Sarkar, Ed.)PLoS ONE, 8(11), e76093. doi:10.1371/journal.pone.0076093

[3] Councill, I. G., Li, H., Zhuang, Z., Debnath, S., Bolelli, L., Lee, W. C., Sivasubramaniam, A., et al. (2006). Learning metadata from the evidence in an on-line citation matching scheme. Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’06 (p. 276). Association for Computing Machinery. doi:10.1145/1141753.1141817