iPhylo: Search results for gbif

Roderic D. M. Page

Showing posts sorted by relevance for query gbif. Sort by date Show all posts

Thursday, March 13, 2014

Publishing biodiversity data directly from GitHub to GBIF

Today I managed to publish some data from a GitHub repository directly to GBIF. Within a few minutes (and with Tim Robertson on hand via Skype to debug a few glitches) the data was automatically indexed by GBIF and its maps updated. You can see the data I uploaded here.

The data I uploaded came from this paper:

Shapiro, L. H., Strazanac, J. S., & Roderick, G. K. (2006, October). Molecular phylogeny of Banza (Orthoptera: Tettigoniidae), the endemic katydids of the Hawaiian Archipelago. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2006.04.006

This is the data I used to build the geophylogeny for Banza using Google Earth. Prior to uploading this data, GBIF had no georeferenced localities for these katydids, now it has 21 occurrences:

Dataset

How it works

I give details of how I did this in the GitHub repository for the data. In brief, I took data from the appendix in the Shapiro et al. paper and created a Darwin Core Archive in a repository in GitHub. Mostly this involved messing with Excel to format the data. I used GBIF's registry API to create a dataset record, pointed it at the GitHub repository, and let GBIF do the rest. There were a few little hiccups, such as needing to tweak the meta.xml file that describes the data, and GBIF's assumption that specimens are identified by the infamous "Darwin Core Triplet" meant I had to invent one for each occurrence, but other than that it was pretty straightforward.

I've talked about using GitHub to help clean up Darwin Core Archives from GBIF, and VertNet are using GitHub as an issue tracker, but what I've done here differs in one crucial way. I'm not just grabbing a file from GBIF and showing that it is broken (with no way to get those fixes to GBIF), nor am I posting bug reports for data hosted elsewhere and hoping that someone will fix it (like VertNet), what I'm doing here is putting data on GitHub and having GBIF harvest that data directly from GitHub. This means I can edit the data, rebuild the Darwin Core Archive file, push it to GitHub, and GBIF will reindex it and update the data on the GBIF portal.

Beyond nodes

GBIF's default publishing model is a federated one. Data providers in countries (such as museums and herbaria) digitise their data and make it available to national aggregators ("nodes"), which typically host a portal with information about the biodiversity of that nation (the Atlas of Living Australia is perhaps the most impressive example). These nodes then make the data available to GBIF, which provides a global portal to the world's biodiversity data (as opposed to national-level access provided by nodes).

This works well if you assume that most biodiversity data is held by national natural history collections, but this is debatable. There are other projects, some of them large and not necessarily "national" that have valuable data. These projects can join GBIF and publish their data. But what about all the data that is held in other databases (perhaps not conventionally thought of as biodiversity databases), or the huge amount of information in the published literature. How does that get into GBIF? People like me who data mine the literature for information on specimens and localities, such as this map of localities mentioned in articles in BioStor. How do we get that data into GBIF?

Biostor

Data publishing

Being able to publish data directly to GBIF makes putting the effort into publishing data seem less onerous, because I can see it appear in GBIF within minutes. Putting 21 records of katydids is clearly a drop in the ocean, but there is potentially vastly more data waiting to be mined. managing the data on GitHub also makes the whole process of data cleaning and edit transparent. As ever, there are a couple of things that still need to be tackled.

It's who you know

I've been able to do this because I have links with GBIF, and they have made the (hopefully reasonable) assumption that I'm not going to publish just any old crap to GBIF. I still had to get "endorsed" by the UK node (being the chair of the GBIF Science Committee probably helped), and I'm lucky that Tim Roberston was online at the time and guided me through the process. None of this is terribly scalable. It would be nice if we had a way to open up GBIF to direct publishing, but also with a review process built in (even if it's a post-review so that data may have to be pulled if it becomes clear it's problematic). Perhaps this could be managed via GitHub, for example data could be uploaded and managed there, and GBIF can then choose to pull that repository and the data would appear on GBIF. Another model is something like the Biodiversity Data Journal, but that doesn't (as far as I know) have a direct feed into GBIF.

Whichever approach we take, we need a simple, frictionless way to get data into GBIF, especially if we want to tackle the obvious geographic and taxonomic biases in the data GBIF currently has.

DOIs please

It would be great if I could get a DOI for this data set. I had toyed with putting it on Figshare which would give me a DOI, but that just puts an additional layer between GitHub and GBIF. Ideally instead (or as well as) the UUID I get from GBIF to identify the dataset, I'd get a DOI that others can cite, and which would appear on my ORCID profile. I'd also want a way to link the data DOI to the DOI for the source paper (doi:10.1016/j.ympev.2006.04.006), so that citations of the data can pass some of that "link love" to the original authors. So, GBIF needs to mint DOIs for datasets.

Summary

The ability to upload data to GitHub and then have that harvested by GBIF is really exciting. We get great tools for managing changes in data, with a simple publication process (OK, simple if you know Tim, and can speak REST to the GBIF API). But we are getting closer to easy publishing and, just as importantly, easy editing and correcting data.

Friday, April 15, 2016

GBIF and impact: CrossRef, FundRef, and Altmetric

Wiki hit For anyone doing research or involved in scientific infrastructure, demonstrating the "impact" of those activities is becoming increasingly important. This has fostered a growth industry in "alt metrics", tools to track how research gathers attention outside academia (of course, we can argue whether attention is the same as impact).

For an organisation such as GBIF there's a clear need to show that it has impact on the field of biodiversity (and beyond), especially to its funders (which are ultimately national governments). To do this GBIF needs to track how its data is used by the research communities, both to do science and to inform policy. This is hard to do, especially if there's a limited culture of data citation. It occurs to me that another way to tackle this problem is to invert it by looking not at the impact of GBIF, but at GBIF as a source of impact.

For a moment let's replace GBIF with Wikipedia. We can ask "what is the impact of Wikipedia on the research community?" For example, Wikipedia is the 8th largest referrer of DOIs, which means that Wikipedia is a major source traffic to academic publishing sites. All those Wikipedia pages which cite the primary literature are driving traffic to those articles.

Conversely, if we regard Wikipedia as important we can use citations of articles in Wikipedia pages as a measure of a researcher's impact. For example, according to Impact story I am "Wikitastic" as 11 Wikipedia pages cite articles that I am an author of (authorship is discovered by using my ORCID 0000-0002-7101-9767).

Likewise, altmetric tracks citations on Wikipedia, so that a paper like the one below may have minimal social media impact but as the gray donut rings signifying that it's been cited on Wikipedia.

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

Hence, we can look at Wikipedia in two different ways. The first is to ask "what is the impact of Wikipedia?", the second is to assume that Wikipedia has impact, and then use that as one measure of the impact of researchers (how "Wikitastic" you are).

So, let's go back to GBIF. Imagine we leave aside the question of whether GBIF has impact and imagine that we can use GBIF as a measure of impact ("GBIFtastic", sorry, that was unforgivable).

Example 1: From DOI to FundRef to GBIF

In a previous post I discussed the lack of mosquito data in GBIF and how I plugged this gap by using open data cite by a paper in eLife. This paper has the DOI 10.7554/elife.08347 and if I plug that into CrossRef's search engine I can get back some information on the funders of that paper:

Research funded by Sir Richard Southwood Graduate Scholarship | Rhodes Scholarships | National Institutes of Health (RAPIDD program, R01-AI069341, R01-AI091980, R01-GM08322, N01-A1-25489) | Wellcome Trust (#095066, Vecnet, #099872) | National Aeronautics and Space Administration (#NNX15AF36G) | Biotechnology and Biological Sciences Research Council | Bill and Melinda Gates Foundation (#OPP1053338, #OPP52250) | Studienstiftung des Deutschen Volkes | Directorate-General for Research and Innovation (#21803) | European Centre for Disease Prevention and Control (ECDC/09/018)

Now, this gives me a connection between funding agencies, a paper they funded, and the data in GBIF. For example, the Bill and Melinda Gates Foundation (doi:10.13039/100000865) funded doi:10.7554/elife.08347 which generated data in GBIF doi:10.15468/7apj8n.

I suspect that the Bill and Melinda Gates Foundation don't know that they've funded data gathering that has ended up in GBIF, but I suspect they'd be interested. Especially if that could be quantified (een better if we can demonstrate reuse). The process of linking funders to data can be largely automated, especially as more and more papers are now automatically linked to funder information. The link between publications and data in GBIF can be harder to establish, but at least one publisher (Pensoft) has establish a direct feed from publication to GBIF.

So, what if GBIF could computationally discover the funders of the data it holds, and could then communicate that to the funders. I think there's scope here for funders to take an interest in GBIF and it's role in expanding the reuse (and hence impact) of data that funders have paid for. Demonstrating to governments that national funding agencies are supporting research that generates data that ends up in GBIF may help make the case that GBIF is worth supporting.

Example 2: GBIF as altmetric source

The little altmetric donuts that we see on papers require sources of data, such as Twitter, Wikipedia, blogs, etc. For example, the Plant List dataset I recently put into GBIF has a DOI (doi:10.15468/btkum2)and this has received some attention so it has a altimetric donut (wouldn't it be nice if GBIF showed these on dataset pages?):

What if GBIF itself became a source that altimetric scanned when measuring impact? What if having your papers mentioned in GBIF (for example, as a source of distributional data or a taxonomic name) contributed to the visible impact of that work. Wouldn't that encourage people to mobilise their data? Wouldn't that help people discover the wider conversation about the data and associated publications? Wouldn't that help generate more impact for papers that might otherwise gather less attention?

Summary

I realise that I've somewhat avoided the question of the impact of GBIF itself, which is something that also needs to be tackled (and this is one reason why GBIF assigns DOIs to datasets and downloads to support data citation), but I think that may be only a part of the bigger picture. If we assume GBIF is impactful to start with, then I think we can start to think how GBIF can help persuade researchers and funders that contributing to GBIF is a good thing.

Monday, June 04, 2018

Towards a biodiversity token: Bitcoin, FinTech, and a radical suggestion for the GBIF Challenge

8VlGI2hk 400x400 First off, let me say that what follows is a lot of arm waving to try and obscure how little I understand what I'm talking about. I'm going to sketch out what I think is a "radical" idea for a GBIF Challenge entry.

TL;DR GBIF should issue it's own cryptocurrency and use that to fund the development of the GBIF network by charging for downloading cleaned, processed data (original provider data remains free). People can buy subscriptions to get access to data, and/or purchase GBIF currency as a contribution or investment. Proceeds from the purchase of cleaned data are divided between GBIF (to fund the portal), the data providers (to reward them making data available) and the GBIF nodes in countries included in the geographic coverage of the data (to help them build their biodiversity infrastructure). The challenge entry would involve modelling this idea and conducting simulations to test it's efficacy.

The motivation for this idea comes from several sources:

1. GBIF is (under-)funded by direct contributions from governments, hence each year it essentially "begs" for money. Several rich countries (such as the United Kingdom) struggle to pay the fairly paltry sums involved. Part of the problem is that they are taking something of demonstrable value (money) and giving it to an organisation (GBIF) which has no demonstrable financial value. Hence the argument for funding GBIF is basically "it's the right thing to do". This is not really a tenable or sustainable model.

2. Many web sites provide information for "free" in that the visitor doesn't pay any money. Instead the visitor views ads and, whether they are aware if it or not, are handing over large amounts of data about themselves and their behaviour (think the recent scandal involving Facebook).

3. Some people are rebelling against the "free with ads" by seeking other ways to fund the web. For example, the Brave web browser enables you to buy BATS (Basic Attention Tokens, based on Ethereum). You can choose to send BATS to web sites that you visit (and hence find valuable). Those sites don't need to harvest tyour data or bombard you with ads to receive an income.

4. Cryptocurrency is being widely explored as a way to raise funding for new ventures. Many of these are tech-based, but there are some interesting developments in conservation and climate change, such as Veridium which offsets carbon emissions. There are links between efforts like Veridium and carbon offset programmes such as the Rimba Raya Biodiversity Reserve, so you can go from cryptocurrency to trees.

5. The rather ugly, somewhat patronising furore that erupted when Rwanda decided that the best way to increase its foreign currency earnings (as a step towards ultimately freeing itself from dependency on development aid) was to sign a sponsorship deal with Arsenal football club.

Now, imagine a situation where GBIF has a cryptocurrency token (e.g., the "GBIF coin"). Anyone, whether a country, an organisation, or an individual can buy GBIF coins. If you want to download GBIF data, you will need to pay in GBIF coins, either per-download or via a monthly subscription. The proceeds from each download are split in a way that supports the GBIF network as a whole. For example, imagine GBIF itself gets 30% (like Apple's App Store). The remaining 70% gets' split between (a) the data providers and (b) the GBIF nodes in countries included in the data download. For example, almost all the data on a country such as Rwanda does not come from Rwanda itself, but from other countries. You want to reward anyone who makes data available, but you also want to support the development of a biodiversity data infrastructure in Rwanda (or any other country), so part of the proceeds go to the GBIF node in Rwanda.

Now, an immediate issue (apart from the merits or otherwise of blockchains and cryptocurrency) is that I'm advocating charging for access to data, which seems antithetical to open access. To be clear, I think open access is crucial. I'm suggesting that we distinguish between two classes of data. The first is the data as it is provided to GBIF. That is almost always open data under a CC0 license, and that remains free. But if you ant it for free it is served as it is received. In other words, for free access to data GBIF is essentially a dumb repository (like, say, Dryad). The data is there, you can search the metadata for each dataset, so essentially you get something like the current dataset search.

The other thing GBIF does is that it processes the data, cleaning it, reconciling names and locations, and indexing it, so that if you want to search for a given species, GBIF summarises the data across all the datasets and (often) presents you with a better result that if you'd downloaded all the original data and simply merged it together yourself. This is a valuable service, and its one of the reasons why GBIF costs money to run. So imagine that we do something like this:

It is free to browse GBIF as a person and explore the data
It is free to download the raw data provided by any data publisher.
It costs to download cleaned data that corresponds to a specific query, e.g. all records for a particular taxon, geographic area, etc.
Payment for access to cleaned data is via the GBIF coin.
The cost is small, on the scale of buying a music track or subscribing to Spotify.

Now, I don't expect GBIF to embrace this idea anytime soon. By nature it's a conservative, risk-averse organisation. But I think something like this idea deserves serious attention, ideally from people with much better understanding of the issues that my own "I saw this on Twitter therefore it must be cool" level. One way to move forward would be to model how such a system would work, based for example on data on web site visits and data downloads on the current GBIF portal. I suspect models could be built to give some idea of whether such an approach would be financially viable. It occurs to me that something like this would make a great GBIF Challenge entry, particularly as it is gives a license for thinking the unthinkable with no risk to GBIF itself.

Thursday, August 14, 2014

Seven percent of GBIF data is usable - quick thoughts on Hjarding et al. 2014

Update: Angelique Hjarding and her co-authors have responded in a guest post on iPhylo.

The quality and fitness for use of GBIF-mobilised data is a topic of interest to anyone that uses GBIF data. As an example, a recent paper on African chameleons comes to some rather alarming conclusions concerning the utility of GBIF data:

Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427

Here's the abstract (unfortunately the paper is behind a paywall):

The IUCN Red List of Threatened Species uses geographical distribution as a key criterion in assessing the conservation status of species. Accurate knowledge of a species’ distribution is therefore essential to ensure the correct categorization is applied. Here we compare the geographical distribution of 35 species of chameleons endemic to East Africa, using data from the Global Biodiversity Information Facility (GBIF) and data compiled by a taxonomic expert. Data screening showed 99.9%of GBIF records used outdated taxonomy and 20% had no locality coordinates. Conversely the expert dataset used 100%up-to-date taxonomy and only seven records (3%) had no coordinates. Both datasets were used to generate range maps for each species, which were then used in preliminary Red List categorization. There was disparity in the categories of 10 species, with eight being assigned a lower threat category based on GBIF data compared with expert data, and the other two assigned a higher category. Our results suggest that before conducting desktop assessments of the threatened status of species, aggregated museum locality data should be vetted against current taxonomy and localities should be verified. We conclude that available online databases are not an adequate substitute for taxonomic experts in assessing the threatened status of species and that Red List assessments may be compromised unless this extra step of verification is carried out.

The authors used two data sets, one from GBIF, the other provided by an expert to compute the conservation status for each chameleon species endemic to Kenya and/or Tanzania. After screening the GBIF data for taxonomic and geographic issues, a mere 7% of the data remained - 93% of the 2304 records downloaded from GBIF were discarded.

This study raises a number of questions, some of which I will touch on here. Before doing so, it's worth noting that it's unfortunate that neither of the two data sets used in this study (the data downloaded from GBIF, and the expert data set assembled by Colin Tilbury) are provided by the authors, so our ability to further explore the results is limited. This is a pity, especially now that citable data repositories such as Dryad and Figshare are available. The value of this paper would have been enhanced if both datasets were archived.

Below is Table 1 from the paper, "Museums from which locality records for East African chameleons were obtained for the expert and GBIF datasets":

Museum	Expert dataset	GBIF
Afrika Museum, The Netherlands	x
American Museum of Natural History, USA	x
Bishop Museum, USA		x
British Museum of Natural History, UK	x
Brussels Museum of Natural Sciences, Belgium	x
California Academy of Sciences, USA		x
Ditsong Museum, South Africa	x	x
Los Angeles County Museum of Natural History, USA		x
Museum für Naturkunde, Germany	x
Museum of Comparative Zoology (Harvard University), USA		x
Naturhistorisches Museum Wien, Austria	x
Smithsonian Institution, USA		x
South African Museum, South Africa	x
Trento Museum of Natural Sciences, Italy	x
University of Dar es Salaam, Tanzania	x
Zoological Research Museum Alexander Koenig, Germany	x

It is striking that there is virtually no overlap in data sources available to GBIF and the sources used by the expert. Some of the museums have no presence in GBIF, including some major collections (I'm looking at you, The Natural History Museum), but some museums do contribute to GBIF, but not their herpetology specimens. So, GBIF has some work to do in mobilising more data (Why is this data not in GBIF? What are the impediments to that happening?). Then there are museums that have data in GBIF, but not in a form useful for this study. For example, the American Museum of Natural History has 327,622 herpetology specimens in GBIF, but not one of these is georeferenced! Given that there are records in GenBank for AMNH specimens that are georeferenced, I suspect that the AMNH collection has deliberately not made geographic coordinates available, which raises the obvious question - why?

GBIF coverage

I had a quick look at GBIF to get some idea of the geographic coverage of the relevant herpetology collections (or animal collections if herps weren't separated out). Below are maps for some of these collections. The AMNH is empty, as is the smaller Zoological Research Museum Alexander Koenig collection (which supplied some of the expert data).

American Museum of Natural History, USA

Bishop Museum, USA

California Academy of Sciences, USA

Ditsong Museum, South Africa

Los Angeles County Museum of Natural History, USA

Museum für Naturkunde, Germany

Museum of Comparative Zoology (Harvard University), USA

Smithsonian Institution, USA

Zoological Research Museum Alexander Koenig, Germany

Some collections are relevant, such as the California Academy of Sciences, but a number of the collections in GBIF simply don't have georeferenced data on chameleons. Then there are several museums that are listed as sources for the expert database and which contribute to GBIF, but haven't digitised their herp collections, or haven't made these available to GBIF.

Taxonomy

The other issue encountered by Hjarding et al. 2014 is that the GBIF taxonomy for chameleons is out of date (2302 of 2304 GBIF-sourced records needed to be updated). Chameleons are a fairly small group, and it's not like there are hundreds of new species being discovered each year (see timeline in BioNames), 2006 was a bumper year with 12 new taxonomic names added. But there has been a lot of recent phylogenetic work which has clarified relationships, and as a result species get shuffled around different genera, resulting in a plethora of synonyms. GBIF's taxonomy has lagged behind current research, and also manages to horribly mangle the chameleon taxonomy is does have. For example, the genus Trioceros is not even placed within the chameleon family Chamaeleonidae but is simply listed as a reptile, which means anyone searching for data on the family Chamaeleonidae will all the Trioceros species.

Summary

The use case for this study seems one of the most basic that GBIF should be able to meet - given some distributions of organisms, compute an assessment of their conservation status. That GBIF-mobilised data is so patently not up to the task in this case is cause for concern.

However, I don't see this is simply a case of expert data set versus GBIF data, I think it's more complicated than that. A big issue here is data availability, and also the extent of data release (assuming that the AMNH is actively withholding geographic coordinates for some, if not most of its specimens). GBIF should be asking those museums that provide data why they've not made georeferenced data available, and if its because the museums simply haven't been able to do this, then how can it help this process? It should also be asking why museums which are part of GBIF haven't mobilised their herpetology data, and again, what can it do to help? Lastly, in an age of rapid taxonomic change driven by phylogenetic analysis, GBIF needs to overhaul the glacial pace at which it incorporates new taxonomic information.

Tuesday, February 21, 2012

Linking GBIF and Genbank

As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy. Specimen codes are not unique, are written in all sorts of ways, there are multiple codes for the same specimen (GenBank sequences may be associated with museum catalogue entries, or which field or collector numbers).

So why undertake what is fast looking like a hopeless task? There are several reasons:

GBIF occurrences have a unique URL which we could potentially use as a unique, resolvable identifier for the corresponding specimen.
Linking GenBank to GBIF would make it possible for GBIF to list sequences associated with a specimen, as well as the associated publication, which means we could demonstrate the "impact" of a specimen. In the simplest terms this could be the number of sequences and publications that use data from the specimen, more sophisticated approaches could use PageRank-like measures, see hdl:10101/npre.2008.1760.1.
Having a unique identifier that is shared across different databases makes it easier to combine data from different sources. For example, if a sequence in GenBank lacks geographic coordinates but the voucher specimen in GBIF is georeferenced, we can use that information to locate the sequence in geographic space (and hence build geophylogenies or add spatial indexes to databases such as TreeBASE). Conversely, if the GenBank sequence is georeferenced but the GBIF record isn't we can update the GBIF record and possibly expand the range of the corresponding taxon (this was part of the motivation behind hdl:10101/npre.2009.3173.1.

As an example, below is the GBIF 1° density map for the frog Pristimantis ridens from GBIF, with the phylogeny from Wang et al.Phylogeography of the Pygmy Rain Frog (Pristimantis ridens) across the lowland wet forests of isthmian Central Americahttp://dx.doi.org/10.1016/j.ympev.2008.02.021 layered over it. I created the KML tree from the corresponding tree in TreeBASE using the tool I described earlier. You can grab the KML for the tree here.

Density

As we'd expect, there is a lot of overlap in the two sources of data. If we investigate further, there are records that are in fact based on the same specimen. For example, if we download the GBIF KML file with individual placemarks we see that in the northern part of the range their are 15 GBIF occurrences that map onto the same point as one of the terminal taxa in the tree.

Gbif

One of these 15 GBIF records (http://data.gbif.org/occurrences/244335848) is for specimen USNM 514547, which is the voucher specimen for EU443175. This gives us a link between the record in GBIF and the record in GenBank. It also gives us a URI we can use for the specimen http://data.gbif.org/occurrences/244335848 instead of the unresolvable and potentially ambiguous USNM 514547.

If we view the geophylogeny from a different vantage point we see numerous localities that don't have occurrences in GBIF.

Nogbif

Close inspection reveals that some of the specimens listed in the Wang et al. paper are actually in GBIF, but lack geographic coordinates. For example the OTU "Pristimantis ridens Nusagandi AJC 0211" has the voucher specimen FMNH 257697. This specimen is in GBIF as http://data.gbif.org/occurrences/57919777/, but without coordinates, so it doesn't appear on the GBIF map. However, both the Wang et al. paper and the GenBank record for the sequence from this specimen EU443164 give the latitude and longitude. In this example, GBIF gives us a unique identifier for the specimen, and GenBank provides data on location that GBIF lacks.

Part of GBIFs success is due to the relative ease of integrating data by taxonomic names (despite the problems caused by synonyms, homonyms, misspellings, etc.) or using spatial coordinates (which immediately enables integration with environmental data. But if we want to integrate at deeper levels then specimen records are the glue that connects GBIF (and its contributing data sources) to sequence databases, phylogenies, and the taxonomic literature (via lists of material exampled). This will not be easy, certainly for legacy data that cites ambiguous specimen codes, but I would argue that the potential rewards are great.

Thursday, October 26, 2023

Where are the plant type specimens? Mapping JSTOR Global Plants to GBIF

How to cite: Page, R. (2023). Where are the plant type specimens? Mapping JSTOR Global Plants to GBIF. https://doi.org/10.59350/m59qn-22v52

This blog post documents my attempts to create links between two major resources for plant taxonomy: JSTOR’s Global Plants and GBIF, specifically between type specimens in JSTOR and the corresponding occurrence in GBIF. The TL;DR is that I have tried to map 1,354,861 records for type specimens from JSTOR to the equivalent record in GBIF, and managed to find 903,945 (67%) matches.

Why do this?

Why do this? Partly because a collaborator asked me, but I’ve long been interested in JSTOR’s Global Plants. This was a massive project to digitise plant type specimens all around the world, generating millions of images of herbarium sheets. It also resulted in a standardised way to refer to a specimen, namely its barcode, which comprises the herbarium code and a number (typically padded to eight digits). These barcodes are converted into JSTOR URLs, so that E00279162 becomes https://plants.jstor.org/stable/10.5555/al.ap.specimen.e00279162. These same barcodes have become the basis of efforts to create stable identifiers for plant specimens, for example https://data.rbge.org.uk/herb/E00279162.

JSTOR created an elegant interface to these specimens, complete with links to literature on JSTOR, BHL, and links to taxon pages on GBIF and elsewhere. It also added the ability to comment on individual specimens using Disqus.

However, JSTOR Global Plants is not open. If you click on a thumbnail image of a herbarium sheet you hit a paywall.

In contrast data in GBIF is open. The table below is a simplified comparison of JSTOR and GBIF.

Feature	JSTOR	GBIF
Open or paywall	Paywall	Open
Consistent identifier	Yes	No
Images	All specimens	Some specimens
Types linked to original name	Yes	Sometimes
Community annotation	Yes	No
Can download the data	No	Yes
API	No	Yes

JSTOR offers a consistent identifier (the barcode), an image, has the type linked to the original name, and community annotation. But there is a paywall, and no way to download data. GBIF is open, enables both bulk download and API access, but often lacks images, and as we shall see below, the identifiers for specimens are a hot mess.

The “Types linked to original name” feature concerns whether the type specimen is connected to the appropriate name. A type is (usually) the type specimen for a single taxonomic name. For example, E00279162 is the type for Achasma subterraneum Holttum. This name is now regarded as a synonym of Etlingera subterranea (Holttum) R. M. Sm. following the transfer to the genus Etlingera. But E00279162 is not a type for the name Etlingera subterranea. JSTOR makes this clear by stating that the type is stored under Etlingera subterranea but is the type for Achasma subterraneum. However, this information does not make it to GBIF, which tells us that E00279162 is a type for Etlingera subterranea and that it knows of no type specimens for Achasma subterraneum. Hence querying GBIF for type specimens is potentially fraught with error.

Hence JSTOR has often cleaner and more accurate data. But it is behind a paywall. Hence I set about to get a list of all the type specimens that JSTOR has, and try and match those to GBIF. This would give me a sense of how much content behind JSTOR’s paywall was freely available in GBIF, as well as how much content JSTOR had that was absent from GBIF. I also wanted to use JSTOR’s reference to the original plant name to get around any GBIF’s tendency to link types to the wrong name.

Challenges

Mapping JSTOR barcodes to records in GBIF proved challenging. In an ideal world specimens would have a single identifier that everyone would use when citing or otherwise referring to that specimen. Of course this is not the case. There are all manner of identifiers, ranging from barcodes, collector names and numbers, local database keys (integers, UUIDs, and anything in between). Some identifiers include version codes. All of this greatly complicates linking barcodes to GBIF records. I made extensive use of my Material examined tool that attempts to translate specimen codes into GBIF records. Under the hood this means lots of regular expressions, and I spent a lot of time adding code to handle all the different ways herbaria manage to mangle barcodes.

In some cases JSTOR barcodes are absent from the specimen information in the GBIF occurrence record itself but are hidden in metadata for the image (such as the URL to the image). My “Material examined” tool uses the GBIF API, and that doesn’t enable searches for parts of image URLs. Hence for some herbaria I had to download the archive, extract media URLs and look for barcodes. In the process I encountered a subtle bug in Safari that truncated downloads, see Downloads failing to include all files in the archive.

Some herbaria have data in both JSTOR and GBIF, but no identifiers in common (other than collector names and numbers, which would require approximate string matching). But in some cases the herbaria have their own web sites which mention the JSTOR barcodes, as well as the identifiers those herbaria do share with GBIF. In these cases I would attempt to scrape the herbaria web sites, extract the barcode and original identifier, then find the original identifier in GBIF.

Another observation is that in some cases the imagery in JSTOR is not the same as GBIF. For example LISC002383 and 813346859 are the same specimens but the images are different. Why are the images provided to JSTOR not being provided to GBIF?

In the process of making this mapping it became clear that there are herbaria that aren’t in GBIF, for example Singapore (SING) is not in GBIF but instead is hosted at Oxford University (!) at https://herbaria.plants.ox.ac.uk/bol/sing. There seem to be a number of herbaria that have content in JSTOR but not GBIF, hence GBIF has gaps in its coverage of type specimens.

Interestingly JSTOR rarely seems to be a destination for links. An exception is the Paris museum, for example specimens MPU015018 has a link to JSTOR record for same specimen MPU015018.

Matching taxonomic names

As a check on matching JSTOR to GBIF I would also check that the taxonomic names associated with the two records are the same. The challenge here is that the names may have changed. Ideally both JSTOR and GBIF would have either a history of name changes, or at least the original name the specimen was associated with (i.e., the name for which the specimen is the type). And of course, this isn’t the case. So I relied on a series of name comparisons, such as “are the names the same?”, “if names are different are the specific epithets the same?”, and “if names are specific epithets are different are the generic names the same?”. Because the spelling of species names can change depending on the gender of the genus, I also used some stemming rules to catch names that were the same even if their ending was different.

This approach will still miss some matches, such as hybrid names, and cases where a specimen is stored under a completely different name (e.g., the original name is a heterotypic synonym of a different name).

Mapping

The mapping made so far is available on GitHub https://github.com/rdmpage/jstor-plant-specimens and Zenodo https://doi.org/10.5281/zenodo.10044359.

At the time of writing I have retrieved 1,354,861 records for type specimens from JSTOR, of which 903,945 (67%) have been matched to GBIF.

This has been a sobering lesson in just how far we are from being able to treat specimens as citable things, we simply don’t have decent identifiers for them. JSTOR made a lot of progress, but that has been hampered by being behind a paywall, and the fact that many of these identifiers are being lost or mangled by the time they make their way into GBIF, which is arguably where most people get information on specimens.

There’s an argument that it would be great to get JSTOR Global Plants into GBIF. It would certainly add a lot of extra images, and also provide a presence for a number of smaller herbaria that aren’t in GBIF. I think there’s also a case to be made for having a GBIF hosted portal for plant type specimens, to help make these valuable objects more visible and discoverable.

Below is a barchart of the top 50 herbaria ranked by number of type specimens in JSTOR, showing the numbers of specimens mapped to GBIF (red) and those not found (blue).

Reading

Boyle, B., Hopkins, N., Lu, Z. et al. The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC Bioinformatics 14, 16 (2013). https://doi.org/10.1186/1471-2105-14-16
CETAF Stable Identifiers (CSI)
CETAF Specimen URI Tester
Holttum, R. E. (1950). The Zingiberaceae of the Malay Peninsula. Gardens’ Bulletin, Singapore, 13(1), 1-249. https://biostor.org/reference/163926
Hyam, R.D., Drinkwater, R.E. & Harris, D.J. Stable citations for herbarium specimens on the internet: an illustration from a taxonomic revision of Duboscia (Malvaceae) Phytotaxa 73: 17–30 (2012). https://doi.org/10.11646/phytotaxa.73.1.4
Rees T (2014) Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases. PLoS ONE 9(9): e107510. https://doi.org/10.1371/journal.pone.0107510
Ryan D (2018) Global Plants: A Model of International Collaboration . Biodiversity Information Science and Standards 2: e28233. https://doi.org/10.3897/biss.2.28233
Ryan, D. (2013), THE GLOBAL PLANTS INITIATIVE CELEBRATES ITS ACHIEVEMENTS AND PLANS FOR THE FUTURE. Taxon, 62: 417-418. https://doi.org/10.12705/622.26
(2016), Global Plants Sustainability: The Past, The Present and The Future. Taxon, 65: 1465-1466. https://doi.org/10.12705/656.38
Smith, G.F. and Figueiredo, E. (2013), Type specimens online: What is available, what is not, and how to proceed; Reflections based on an analysis of the images of type specimens of southern African Polygala (Polygalaceae) accessible on the worldwide web. Taxon, 62: 801-806. https://doi.org/10.12705/624.5
Smith, R. M. (1986). New combinations in Etlingera Giseke (Zingiberaceae). Notes from the Royal Botanic Garden Edinburgh, 43(2), 243-254.
Anna Svensson; Global Plants and Digital Letters: Epistemological Implications of Digitising the Directors’ Correspondence at the Royal Botanic Gardens, Kew. Environmental Humanities 1 May 2015; 6 (1): 73–102. doi: https://doi.org/10.1215/22011919-3615907

Written with StackEdit.

Friday, April 15, 2016

The Zika virus, GBIF, and the missing mosquitoes

One of GBIF's goals is to provide up to date, comprehensive data on the distribution of species. Although GBIF's taxonomy and geographic scope is global, not all species are equal, in the sense that the need for information on some species is potentially much more pressing. An example are mosquitoes of the genus Aedes, such as the species A. aegypti and A. albopictus that spread the Zika virus.

Over the last few days I discovered how poor GBIF's coverage of these two vectors is, and a way to fix that gap quickly. Like many things I work on, I stumbled across the problem by accident. GBIF has released a report on whether GBIF data are fit for modeling species distributions. The publicity material included a psychedelic image showing a map for Aedes aegypti from a recent eLife paper by Kraemer et al. (The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus http://doi.org/10.7554/elife.08347 ).

Moritz et al 2015 Global Aedes aegypti distribution detail2

Curious, I read the paper and the phrase "GBIF" occurs only once in the text:

we selected 10,000 occurrence records of Aedes species from the Global Biodiversity Information Facility (http://www.gbif.org), omitting all records of Ae. aegypti and Ae. albopictus. This dataset is intended to reflect biases in mosquito reporting in areas which are suitable for Aedes mosquitoes.

So, GBIF data on these two mosquitoes wasn't used. A quick look at what GBIF had for Aedes albopictus and it's not surprising why GBIF data played such a small role:

Compare this with the data shown in the Scientific Data paper (http://doi.org/10.1038/sdata.2015.35 on the data that underpins the eLife paper.

Note the striking lack of any GBIF records from Brazil. Fortunately the data collected by Kraemer et al. are freely available in Dryad http://doi.org/10.5061/dryad.47v3c, so I grabbed the files, fussed about with them a bit (https://github.com/rdmpage/global-distribution-arbovirus-vectors) to get them into the format required by GBIF, and uploaded them. Below is the data for Aedes albopictus in GBIF:

This is looking more like it! If you are more interested in Aedes aegypti then that data is also available.

Questions

This example raises a number of questions:

How come GBIF had such poor data to start with? If GBIF is going to be relevant to people who need biodiversity data, in some cases urgently, then there's an argument to be made that GBIF should be targeting species such as disease vectors that are likely to be in demand in the future.
Why wasn't the latest data in GBIF? One reason GBIF's data was poor is that the relevant data was widely scattered in the literature (Kraemer et al. list over 1000 papers that they looked at, not including the unpublished sources). This clearly requires a lot of effort to assemble. But once assembled, why wasn't it deposited in GBIF? Is it a case of researchers not thinking this would be a useful thing to do, or not knowing how to do it?
What about all the other data out there? This particular example was prompted by me wondering what is that hideous image on the GBIF post, reading the eLife article, wondering where the data was, and having sufficient access to GBIF to simply upload the data. This is clearly not a scalable approach. How can we improve this process? Can we automate harvesting relevant data from repositories such as Dryad so that this data gets fed into GBIF automatically? Should GBIF become a data repository itself so authors can store their data there? And how do we retrospectively harvest all the rest of the data languishing in the scientific literature?

Side note

One aspect of the Kraemer et al. data I've not focussed on is that it is derived from the literature, most of it unpublished, but some is in the primary literature (the list of papers is missing from the Dryad repository but I obtained a copy from Moritz Kraemer (@MOUGK and it's now on github). This means we can link individual occurrence records back to the evidence for that occurrence (i.e., the paper that made the assertion that this species of mosquito is found at this locality). This means we can (a) provide provenance for the data, and (b) provide credit to the authors of that observation. I hope to explore this topic in a subsequent blog post.

References

Kraemer, M. U. G., Sinka, M. E., Duda, K. A., Mylne, A., Shearer, F. M., Brady, O. J., … Hay, S. I. (2015, July 7). The global compendium of Aedes aegypti and Ae. albopictus occurrence. Scientific Data. Nature Publishing Group. http://doi.org/10.1038/sdata.2015.35

Kraemer, Moritz U. G., Sinka, Marianne E., Duda, Kirsten A., Mylne, Adrian, Shearer, Freya M., Brady, Oliver J., … Hay, Simon I. (2015). Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence. Dryad Digital Repository. http://doi.org/10.5061/dryad.47v3c

Kraemer, M. U., Sinka, M. E., Duda, K. A., Mylne, A. Q., Shearer, F. M., Barker, C. M., … Hay, S. I. (2015, June 30). The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus . eLife. eLife Sciences Organisation, Ltd. http://doi.org/10.7554/elife.08347

Wednesday, January 28, 2015

Annotating GBIF, from datasets to nanopublications

Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.

Duplication

As an example, consider the following two occurrences for Psilogramma menephron:

occurrence	taxon	longitude	latitude	catalogue number	sequence
887386322	Psilogramma menephron Cramer, 1780	145.86301	-17.44	BC ZSM Lep 01337
1009633027	Psilogramma menephron Cramer, 1780	145.86	-17.44	KJ168695	KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIs

In terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.

Nanopublications

If we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are:

A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.

A nanopublication comprises three elements:

The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112

Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

Assertion
- 668534424 identifiedAs Rhacophorus borneensis
Provenance
- doi:10.5358/hsj.32.112
PublicationInfo
- author: 0000-0002-7101-9767
- identifier doi:xxxx

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?

Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.

Summary

My goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.

Saturday, June 02, 2012

Linking NCBI taxonomy to GBIF

@rdmpage, given an NCBI taxon ID, how to get GBIF occurrence records via web service?
— Rutger Vos (@rvosa) May 30, 2012

In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:


<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Tuesday, October 15, 2013

What can Global Biodiversity Information Facility (GBIF) do for you?

I've recently been appointed Chair of the Science Committee of the Global Biodiversity Information Facility (GBIF) http://www.gbif.org [1]. The committee is a small group of people with a range of backgrounds, and one of our roles is to advise GBIF on matters scientific (e.g., what kinds of data GBIF should collect?, what kinds of scientific questions should GBIF help answer?, etc.).

There have been formal surveys (see the papers in the journal "Biodiversity Informatics" https://journals.ku.edu/index.php/jbi/issue/view/370/showToc ), meetings, and a "vision" statement (the "Global Biodiversity Informatics Outlook, http://www.biodiversityinformatics.org/ ). But there's always the chance that these fora may miss some points of view, so I'm keen to get feedback on what sort of things GBIF could do to improve the way it can help people tackle the scientific questions they are interested in.

For example, is there some fundamental limitation that GBIF has that prevents it being useful to you? Is there some feature/data type/geographic coverage/etc. that could be addressed that would make it more useful? Is there a role that GBIF should take on that it hasn't done so? A useful analogy might be to think of the central role GenBank plays in genomics, both as a place to archive your data (sequences), a repository of other people's data that you can access, and a research tool (e.g., BLAST searches to locate similar sequences). Is that the sort of thing you'd want from GBIF, or is it something entirely different?

I'd welcome any comments, suggestions, views, etc. Feel free to add them as comments to this blog, or email me (rdmpage at gmail.com).

I should stress that this is simply me trying to calibrate my perception of GBIF's role with what others think. Also, note if you have specific comments on things such as the GBIF web site please use the feedback tab on the site (that way it will reach the people who can do something about it).

[1] For those unfamiliar with GBIF, its mission "is to make the world's biodiversity data freely and openly available via the Internet". At present the bulk of the data are observations of organisms (mostly multicellular eukaryotes, i.e., animals, plants and fungi) based on either museum collections or observations of living organisms. You can get an idea of the kind of science that uses GBIF-hosted data from this list of papers on Mendeley http://www.mendeley.com/groups/1068301/gbif-public-library/

Updates

Based on responses so far I'll compile a list below of suggestions/themes.

Annotation

Have the ability to annotate records (e.g., flag errors) and some mechanism where those annotations get incorporated into GBIF and/or primary data providers.

Dashboard/gap analysis

For any search provide information on how complete and/or representative the data is likely to be (for example, are vertebrates over-represented, what is the extent of sampling in this area, etc.).

Geographic coverage

Fill big gaps in coverage (e.g., Russia, China, much of the tropics).

Genomics

Link GBIF occurrence records to sequences in GenBank

Provenance

Who identified specimen?
Details on georeferencing (esp. if not GPS)

Data types

DNA sequences
abundance

Data sources

GenBank
Literature records (e.g., data mining published papers)
MEIER, R., & DIKOW, T. (2004). Significance of Specimen Databases from Taxonomic Revisions for Estimating and Mapping the Global Species Diversity of Invertebrates and Repatriating Reliable Specimen Data. Conservation Biology, 18(2), 478–488. doi:10.1111/j.1523-1739.2004.00233.x
"Gray" literature, e.g. field books, reports

Identifiers

Lack of stable identifiers for occurrences
Contributors of specimen data not (yet) in an institution have to mint their own identifiers, with no way of linking those to any future identifier minted by the institution that will eventually house their collection)

Interface

Being able to refine taxon search by geographic region
Search on any Darwin Core field
Wild card search
Support for GIS data formats
Search using arbitrary bounding polygons (e.g., draw a shape on a map)

Timeliness

New taxa take a while to be added to GBIF classification, even if added using IPT. For example Megachile chomskyi discussed in Leafcutter bee new to science with specimen data on Canadensys is not represented in the GBIF backbone taxonomy (as of 2013-10-18).

Tuesday, August 19, 2014

Guest post: Response to the discussion on Red List assessments of East African chameleons

This is guest post by Angelique Hjarding in response to discussion on this blog about the paper below.

Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427

Thank you for highlighting our recent publication and for the very interesting comments. We wanted to take the opportunity to address some of the issues brought up in both your review and from reader comments.

One of the most important issues that has been raised is the sharing of cleaned and vetted datasets. It has been suggested that the datasets used in our study be uploaded to a repository that can be cited and shared. This is possible for data that was downloaded from GBIF as they have already done the legwork to obtain data sharing agreements with the contributing organizations. So as long as credit is properly given to the source of the data, publicly sharing data accessed through GBIF should be acceptable. At the time the manuscript was submitted for publication, we were unaware of sites such as http://figshare.com where the data could be stored and shared with no additional cost to the contributor. The dataset used in the study that used GBIF data has now been made available in this way.

Angelique Hjarding. (2014). Endemic Chameleons of Kenya and Tanzania. Figshare. doi:10.6084/m9.figshare.1141858

It starts to get tricky with doing the same for the expert vetted data. This dataset consists primarily of data gather by the expert from museum records and published literature. So in this case it is not a question of why the expert doesn’t share. The question is why the museum data and any additional literature records are not on GBIF already. As has been pointed out in our analysis (and confirmed by Rod) most of these museums do not currently have data sharing agreements with GBIF. Therefore, the expert who compiled the data does not have the permission of the museums to share their data second hand. Bottom line, all of the data used in this study that was not accessed through GBIF is currently available from the sources directly. That is, for anyone who wants to take the time contact the museums for permission to use their data for research and to compile it. We also do not believe there is blame on museums that have not yet shared their data with forums such as GBIF. Mobilisation of data is an enormous task, and near impossible if funds and staff are not available. With regards to the particular comment regarding the lack of data sharing by NHML and other museums, we need to recognise what the task at hand would mean, and rather address ways such a monumental, and valuable, collection could be mobilised. A further issue should be raised around literature records that are not necessarily encapsulated in museum collections, but are buried in old and obscure manuscripts. To our knowledge, there is no way to mobilise those records either, because they are not attached to a specimen. Further, because there are no specimens, extreme care must be taken if such records were to be mobilised in order to ensure quality control. Again, assistance of expert knowledge would be highly beneficial, yet these things take time and require funds.

Another issue that was raised is why didn’t we go directly to GBIF to fix the records? The point of our research was not to clean and update GBIF/museum data but to evaluate the effect of expert vetting and museum data mobilization in an applied conservation setting. As it has been pointed out, the lead author was working at GBIF during the course of the research. An effort was made to provide a checklist of the updated taxonomy to GBIF at the time, but there was no GBIF mechanism for providing updates. This appears to still be the case. In addition, two GBIF staff provided comments on the paper and were acknowledged for their input. We are happy to provide an updated taxonomy to help improve the data quality, should some submission tool for updates be made available.

Finally we would like to address the question, why use GBIF data if we know it needs some work before it can be used? We believe this is a very important debate for at least two reasons. First, when data is made public, we believe there are many researchers who work under the assumption that the data is ready for use with minimal further work. We believe they assume that the taxonomy is up to date; that the records are in the right place; and that the records provided relate to the name that is attached to those records. Many of the papers that have used GBIF data have undertaken broad scale macroecological analyses where, perhaps, the errors we have shown matter little. But some of these synthetic studies have also proposed that their results can be used for decision making by companies, which starts to raise concerns especially if the company wants to know the exact species that its activities could impact. As we have shown, for chameleons at least, such advice would be hard to provide using the raw GBIF data.

Second, we are aware that there is another group of researchers using GBIF data who "know that to use GBIF's data you need to do a certain amount of previous work and run some tests, and if the data does not pass the tests, you don't use it." We are not sure of the tests that are run, and it would be useful to have these spelled out for broader debate and potentially the development of some agreed protocols for data cleaning for various uses.

Our underlying reason for writing the paper was not to enter into debate of which data are best between GBIF and an expert compiled dataset. We are extremely pleased that GBIF data exist, and are freely available for the use of all. This certainly has to be part of the future of 'better data for better decisions', but we are concerned that we should not just accept that the data is the best we can get, but should instead look for ways to improve it, for all kinds of purposes. As such, we would like to suggest that the discussion focuses some energy on ways to address the shortcomings of the present system, but also that the community who would benefit from the data address ways to assist the dataholders to mobilise their information in terms of accessing the resources required to digitise and make data available, and maintain updated taxonomy for their holdings. In an era of declining funding for Museum-based taxonomy in many parts of the world this is certainly a challenge that needs to be addressed.

We welcome further discussion as this is a very important topic, not only for conservation but also in terms of improved access to biodiversity knowledge, which is critical for many reasons.

Angelique Hjarding http://orcid.org/0000-0002-9279-4893
Krystal Tolley
Neil Burgess

Monday, September 18, 2017

Guest post: Our taxonomy is not your taxonomy

The following is a guest post by Bob Mesibov.

Do you know the party game "Telephone", also known as "Chinese Whispers"? The first player whispers a message in the ear of the next player, who passes the message in the same way to a third player, and so on. When the last player has heard the whispered message, the starting and finishing versions of the message are spoken out loud. The two versions are rarely the same. Information is usually lost, added or modified as the message is passed from player to player, and the changes are often pretty funny.

I recently compared ca 100 000 beetle records as they appear in the Museums Victoria (NMV) database and in DarwinCore downloads from the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF). NMV has its records aggregated by ALA, and ALA passes its records to GBIF. The "Telephone" effect in the NMV to ALA to GBIF comparison was large and not particularly funny.

Many of the data changes occur in beetle names. ALA checks the NMV-supplied names against a look-up table called the National Species List, which in this case derives from the Australian Faunal Directory (AFD). If no match is found, ALA generalises the record to the next higher supplied taxon, which it also checks against the AFD. ALA also replaces supplied names if they are synonyms of an accepted name in the AFD.

GBIF does the same in turn with the names it gets from ALA. I'm not 100% sure what GBIF uses as beetle look-up table or tables, but in many other cases their GBIF Backbone Taxonomy mirrors the Catalogue of Life.

To give you some idea of the magnitude of the changes, of ca 85000 NMV records supplied with a genus+species combination, about one in five finished up in GBIF with a different combination. The "taxonRank" changes are summarised in the overview below, and note that replacement ALA and GBIF taxon names at the same rank are often different:

Of the species that escaped generalisation to a higher taxon, there are 42 names with genus triples: three different genus names for the same taxon in NMV, ALA and GBIF.

Just one example: a paratype of the staphylinid Schaufussia mona Wilson, 1926 is held in NMV. The record is listed under Rytus howittii (King, 1866) in the ALA Darwin Core download, because AFD lists Schaufussia mona as a junior subjective synonym of Tyrus howitti King, 1866, and Tyrus howittii in AFD is in turn listed as a synonym of Rytus howittii (King, 1866). The record appears in GBIF under Tyraphus howitti (King, 1865), with Rytus howittii (King, 1866) listed as a synonym. In AFD, Rytus howittii is in the tribe Tyrini, while Tyraphus howitti is a different species in the tribe Pselaphini.

ALA gives "typeStatus" as "paratype" for this record, but the specimen is not a paratype of Rytus howittii. In the GBIF download, the "typeStatus" field is blank for all records. I understand this may change in future. If it does, I hope the specimen doesn't become a paratype of Tyraphus howitti through copying from ALA.

There are lots of "Telephone" changes in non-taxonomic fields as well, including some geographical howlers. ALA says that a Kakadu National Park record is from Zambia and another Northern Territory record is from Mozambique, because ALA trusts the incorrect longitude provided by NMV more than it does the NMV-supplied locality text. GBIF blanks this locality text field, leaving the GBIF user with two African records for Australian specimens and no internal contradictions.

ALA trusts latitude/longitude to the extent of changing the "stateProvince" field for localities near Australian State borders, if a low-precision latitude/longitude places the occurrence a short distance away in an adjoining State.

Manglings are particularly numerous in the "recordedBy" field, where name strings are reformatted, not always successfully. Complex NMV strings suffer worst, e.g. "C Oke; Charles John Gabriel" in NMV becomes "Oke, C.|null" in ALA, and "Ms Deb Malseed - Winda-Mara Aboriginal Corporation WMAC; Ms Simone Sailor - Winda-Mara Aboriginal Corporation WMAC" is reformatted as in ALA "null|null|null|null"

Most of the "Telephone" effect in the NMV-ALA-GBIF comparison appears in the NMV-ALA stage. I contacted ALA by email and posted some of the issues on the ALA GitHub site; I haven't had a response and the issues are still open. I also contacted Tim Robertson at GBIF, who tells me that GBIF is working on the ALA-GBIF stage.

Can you get data as originally supplied by NMV to ALA, through ALA? Well, that's easy enough record-by-record on the ALA website, but not so easy (or not possible) for a multi-record download. Same with GBIF, but in this case the "original" data are the ALA versions.