iPhylo: GBIF

Roderic D. M. Page

Showing posts with label GBIF. Show all posts

Friday, May 27, 2022

Round trip from identifiers to citations and back again

Note to self (basically rewriting last year's Finding citations of specimens).

Bibliographic data supports going from identifier to citation string and back again, so we can do a "round trip."

1.

Given a DOI we can get structured data with a simple HTTP fetch, then use a tool such as citation.js to convert that data into a human-readable string in a variety of formats.

Identifier	⟶	Structured data	⟶	Human readable string
10.7717/peerj-cs.214	HTTP with content-negotiation	CSL-JSON	CSL templates	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

2.

Going in the reverse direction (string to identifier) is a little more challenging. In the "old days" a typical strategy was to attempt to parse the citation string into structured data (see AnyStyle for a nice example of this), then we could extract a truple of (journal, volume, starting page) and use that to query CrossRef to find if there was an article with that tuple, which gave us the DOI.

Identifier	⟵	Structured data	⟵	Human readable string
10.7717/peerj-cs.214	OpenURL query	journal, volume, start page	Citation parser	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

3.

Another strategy is to take all the citations strings for each DOI, index those in a search engine, then just use a simple search to find the best match to your citation string, and hence the DOI. This is what https://search.crossref.org does.

Identifier	⟵	Human readable string
10.7717/peerj-cs.214	search	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

At the moment my work on material citations (i.e., lists of specimens in taxonomic papers) is focussing on 1 (generating citations from specimen data in GBIF) and 2 (parsing citations into structured data).

Friday, June 04, 2021

Thoughts on BHL, ALA, GBIF, and Plazi

If you compare the impact that BHL and Plazi have on GBIF, then it's clear that BHL is almost invisible. Plazi has successfully in carved out a niche where they generate tens of thousands of datasets from text mining the taxonomic literature, whereas BHL is a participant in name only. It's not as if BHL lacks geographic data. I recently added back a map display in BioStor where each dot is a pair of latitude and longitude coordinates mentioned in an article derived from BHL's scans.

This data has the potential to fill in gaps in our knowledge of species distributions. For example, the Atlas of Living Australia (ALA) shows the following map for the cladoceran (water flea) Simocephalus:

Compare this to the localities mentioned in just one paper on this genus:

Timms, B. V. (1989). Simocephalus Schoedler (Cladocera: Daphniidae) in tropical Australia. The Beagle, 6, 89–96. Retrieved from https://biostor.org/reference/241776

There are records in this paper for species that currently have no records at all in ALA (e.g., Simocephalus serrulatus):

As it stands BioStor simply extracts localities, it doesn't extract the full "material citation" from the text (that is, the specimen code, date collected, locality, etc. for each occurrence). If it did, it would then be in a position to contribute a large amount of data to ALA and GBIF (and elsewhere). Not only that, if it followed the Plazi model this contribution would be measurable (for example, in terms of numbers of records added, and numbers of data citations). Plazi makes some of its parsing tools available as web services (e.g., http://tb.plazi.org/GgWS/wss/test and https://github.com/gsautter/goldengate-webservices), so in principle we could parse BHL content and extract data in a form usable by ALA and GBIF.

Notes on Plazi web service

The endpoint is http://tb.plazi.org/GgWS/wss/invokeFunction and it accepts POST requests, e.g. data=Namibia%3A%2058%20km%20W%20of%20Kamanjab%20Rest%20Camp%20on%20road%20to%20Grootberg%20Pass%20%2819%C2%B038%2757%22S%2C%2014%C2%B024%2733%22E%29&functionName=GeoCoordinateTaggerNormalizing.webService&dataUrl=&dataFormat=TXT and returns XML.

Wednesday, October 21, 2020

GBIF Challenge success

Somewhat stunned by the fact that my DNA barcode browser I described earlier was one of the (minor) prizewinners in this year's GBIF Ebbe Nielsen Challenge. For details on the winner and other place getters see ShinyBIOMOD wins 2020 GBIF Ebbe Nielsen Challenge. Obviously I'm biased, but it's nice to see the challenge inspiring creativity in biodiversity informatics. Congratulations to everyone who took part.

My entry is live at https://dna-barcode-browser.herokuapp.com. I had a difficult time keeping it online over the summer due to meow attacks, but managed to sort that out. Turns out the cloud provider I used to host Elasticsearch switched from securing the server by default to making it completely open to anyone, and I'd missed that change.

Given that the project was a success, I'm tempted to revisit it and explore further ways to combine phylogenetic trees in a biodiversity browser.

Wednesday, July 22, 2020

DNA barcode browser

Motivated by the 2020 Ebbe Nielsen Challenge I've put together an interactive DNA barcode browser. The app is live at https://dna-barcode-browser.herokuapp.com.

A naturalist from the 19th century would find little in GBIF that they weren’t familiar with. We have species in a Linnean hierarchy, their distributions plotted on a map. This method of summarising data is appropriate to much of the data in GBIF, but impoverishes the display of sequence data such as barcodes. Given a set of DNA barcodes we can compute a phylogeny for those sequences, and gain evidence for taxonomic groups, intraspecific genetic structure, etc. So I wanted to see if it was possible to make simple tool to interactively explore barcode data. This means we need fast methods for searching for similar sequences, and building phylogenies. I've been experimenting with ways to do this for the last couple of years, but have only now managed to put something together. For more details, see the repository. There is also a quick introductory video.

Monday, March 23, 2020

Darwin Core Million promo: best and worst

The following is a guest post by Bob Mesibov.
There's still time (to 31 March) to enter a dataset in the 2020 Darwin Core Million, and by way of encouragement I'll celebrate here the best and worst Darwin Core datasets I've seen.
The two best are real stand-outs because both are collections of IPT resources rather than one-off wonders.

The first is published by the Peabody Museum of Natural History at Yale University. Their IPT website has 10 occurrence datasets totalling ca 1.6M records updated daily, and I've only found minor data issues in the Peabody offerings. A recent sample audit of the 151,138 records with 70 populated Darwin Core fields in the botany dataset (as of 2020-03-18) showed refreshingly clean data:

entries correctly assigned to DwC fields
no missing-but-expected entry gaps
consistent, widely accepted vocabularies and formatting in DwC fields
no duplicate records
no character encoding errors
no gremlin characters
no excess whitespace or fancy alternatives to simple ASCII characters

The dataset isn't perfect and occurrenceRemarks entries are truncated at 254 characters, but other errors are scarce and easily fixed, such as

14 records with plant taxa mis-classified as animals
4 records with dateIdentified earlier than eventDate
minor pseudo-duplication in several fields, e.g. "Anna Murray Vail; Elizabeth G. Britton" and "Anne Murray Vail; Elizabeth G. Britton" in recordedBy
minor content errors in some entries, e.g. "tissue frozen; tissue frozen" and "|" (with no other characters in the entry).

I doubt if it would take more than an hour to fix all the Peabody Museum issues besides the truncation one, which for an IPT dataset with 10.5M data items is outstanding. There are even fields in which the Museum has gone beyond what most data users would expect. Entries in vernacularName, for example, are semicolon-separated hierarchies of common names: "dwarf snapdragon; angiosperms; tracheophytes; plants" for Chaenorhinum minus.

The second IPT resource worth commending comes from GBIF Portugal and consists of 108 checklist, occurrence record and sampling event datasets. As with the Peabody resource, the datasets are consistently clean with only minor (and scattered) structural, format or content issues.

The problems appearing most often in these datasets are "double-encoding" errors with Portugese words and no-break spaces in place of plain spaces, and for both of these we can probably blame the use of Windows programs (like Excel) at the contributing institutions. An example of double-encoding: the Portugese "prôximo" is first encoded in UTF-8 as a 2-byte character, then read by a Windows program as two separate bytes, then converted back to UTF-8, resulting in the gibberish "prÃ´ximo". A large proportion of the no-break spaces in the Portugese datasets unfortunately occur in taxon name strings, which don't parse correctly and which GBIF won't taxon-match.

And the worst dataset? I've seen some pretty dreadful examples from around the world, but the UK's Natural History Museum sits at the top of my list of delinquent providers. The NHM offers several million records and a disappointingly high proportion of these have very serious data quality problems. These include invalid and inappropriate entries, disagreements between fields and missing-but-expected blanks.

Ironically, the NHM's data portal allows the visitor to select and examine/download records with any one of a number of GBIF issues, like "taxon_match_none". Further, for each record the data portal reports "GBIF quality indicators", as shown in this screenshot:

Clicking on that indicator box gives the portal visitor a list of the things that GBIF found wrong with the record (a list that overlaps incompletely with the list I can find with a data audit). I'm sure the NHM sees this facility differently, but to me it nicely demonstrates that NHM has prioritised Web development over data management. The message I get is

"We know there's a lot wrong with our data, but we're not going to fix anything. Instead, we're going to hand our mess as-is to any data users out there, with cleverly designed pointers to our many failures. Suck it up, people."

In isolation NHM might be seen as doing what it can with the resources it has. In a broader context the publication of multitudes of defective records by NHM is scandalous. Institutions with smaller budgets and fewer staff do a lot better with their data — see above.

Coronavirus

If your institution is closed and you have spare work-from-home time, consider doing some data cleaning. For those not afraid of the command line, I've archived the websites A Data Cleaner's Cookbook (version 2) and its companion blog BASHing data (first 100 posts) in Zenodo with local links between the two, so that the two resources can be downloaded and used offline in any Web browser.

Tuesday, March 03, 2020

The 2020 Darwin Core Million

The following is a guest post by Bob Mesibov.

You're feeling pretty good about your institution's collections data. After carefully tucking all the data items into their correct Darwin Core fields, you uploaded the occurrence records to GBIF, the Atlas of Living Australia (ALA) or another aggregator, and you got back a great report:

all your scientific names were in the aggregator's taxonomic backbone
all your coordinates were in the countries you said they were
all your dates were OK (and in ISO 8601 format!)
all your recorders and identifiers were properly named
no key data items were missing

OK, ready for the next challenge for your data? Ready for the 2020 Darwin Core Million?

How it works

From the dataset you uploaded to the aggregator, select about one million data items. That could be, say, 50000 records in 20 populated Darwin Core fields, or 20000 records in 50 populated Darwin Core fields, or something in between. Send me the data for auditing before 31 March 2020 as a zipped plain-text file by email to robert.mesibov@gmail.com, together with a DOI or other identifier for their online, aggregated presence.

I'll audit datasets in the order I receive them. If I can't any find data quality problems in your dataset, I'll pay your institution AUD$150 and declare your institution the winner of the 2020 Darwin Core Million here on iPhylo. (One winner only; datasets received after the first problem-free dataset won't be checked.)

If I find data quality problems, I'll let you know by email. If you want to learn what the problems are, I'll send you a report detailing what should be fixed and you'll pay me AUD$150. At 0.3-0.75c/record, that's a bargain compared to commercial data-checking rates. And it would be really good to hear, later on, that those problems had indeed been fixed and corrected data had been uploaded to the aggregator.

What I look for

For a list of data quality problems, see this page in my Data Cleaner's Cookbook. The key problems are:

duplicate records
invalid data items
data items in the wrong fields
data items inappropriate for their field
truncated data items
records with items in one field disagreeing with items in another
character encoding errors
wildly erroneous dates or coordinates
incorrect or inconsistent formatting of dates, names and other data

If you think some of this is just nit-picking, you're probably thinking of your data items as things for humans to read and interpret. But these are digital data items intended for parsing and managing by computers. "Western Hill" might not be the same as "Western Hill" in processing, for example, because the second item might have a no-break space between the words instead of a plain space. Another example: humans see these 22 variations on collector names as "the same", but computers don't.

You might also be thinking that data quality is all about data correctness. Is Western Hill really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? But it's possible to have entirely correct digital data that can't be processed by an application, or moved between applications, because the data suffer from one or more of the problems listed above.

I think my money is safe

The problems I look for are all easily found and fixed. However, as mentioned in a previous iPhylo post, the quality of the many institutional datasets that I've sample-audited ranges from mostly OK to pretty awful. I've also audited more than 100 datasets (many with multiple data tables) for Pensoft Publishers, and the occurrence records among them were never error-free. Some of those errors had vanished when the records had been uploaded to GBIF, because GBIF simply deleted the offending data items during processing (GBIF, bless 'em, also publish the original data items).

Neither institutions nor aggregators seem to treat occurrence records with the same regard for detail that you find in real scientific data, the kind that appear in tables in scientific journal articles. A comparison with enterprise data is even more discouraging. I'm not aware of any large museum or herbarium with a Curator of Data on the payroll, probably because no institution's income depends on the quality of the institution's data, and because collection records don't get audited the way company records do, for tax, insurance and good-governance purposes.

So there might be a winner this year, but I doubt it. Maybe next year. ALA has a year-long data quality project underway, and GBIF Executive Secretary Joe Miller (in litt.) says that GBIF is now paying closer attention to data quality. The 2021 Darwin Core Million prize could be yours...

Friday, December 20, 2019

GBIF metagenomics and metacrap

Yes, this is a clickbait headline, and yes, it may seem like shooting fish in a barrel to complain about crappy data in GBIF, but my point here is raise concerns about the impact of metagenomic data on GBIF, and how difficult it may be to track down the causes of errors.

I stumbled across this example while looking for specimen records for the genus Rafflesia, which are parasitic plants famous for the spectacular size of their flowers (up to 1m across).

The GBIF map for Rafflesia shows a few outliers. Unfortunately GBIF doesn't make it easy to drill down (why oh why can't we just click on the map and see the corresponding occurrences?) so I opened the map in iSpecies and clicked on each outlier in turn. The one in Vanuatu (438164267 from the Paris museum P00577336) is identified to genus level only and has the note:

Parasite terrestre, grande fleur orange au ras du sol. Incomplète suite à prédation. Très forte odeur désagréable. Récolté par Sylvain Hugel (photo) (alcool seul)

which Google translates as:

Terrestrial parasite, large orange flower at ground level. Incomplete due to predation. Very strong unpleasant odor. Collected by Sylvain Hugel (photo) (alcohol only)

Sounds a bit like Rafflesia but there's no photo or other information available online. Likewise there's no additional data for the record from Brazil (1090499968). There is a record from Madagascar that is accompanied by a photo, but it that looks nothing like Rafflesia (1261055923):

That leaves two records, 2018528337 (Rafflesia cantleyi) from off the coast of Peru, and 2014813273 (Rafflesia) off the coast of Australia. Both of these records are metagenomic. For example, occurrence 2018528337 is part of a dataset Amplicon sequencing of Tara Oceans DNA samples corresponding to size fractions for protists that, on the face of it, would be an unlikely source of occurrences of forest dwelling plants.

What we get in the GBIF occurrence record is a link to the pipeline used to generate the data (Pipeline version 4.1 - 17-Jan-2018), the sample (ERS491947), and an analysis (MGYA00167469) that summarises all the taxonomic data from the ocean water sample.

What we don't get in GBIF is an obvious way to try and figure out why GBIF thinks that large flowers live in the ocean. I followed the links from MGYA00167469 and downloaded a bunch of files, some in familiar formats (FASTA), others in formats I'd not seen before (e.g., HDF5). From the mapseq file we have the following line:

ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1 GFBU01000011.4303.6106 76 0.9634146094322205 79 3 0 6 88 1722 1804 +  sk__Eukaryota;k__Viridiplantae;p__Streptophyta;c__;o__Malpighiales;f__Rafflesiaceae;g__Rafflesia;s__Rafflesia_cantleyi

This tells us that sequence ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1 matches GenBank sequence GFBU01000011, which is "Rafflesia cantleyi RC_11 transcribed RNA sequence" from a paper on flower development in Rafflesia cantleyi doi:10.1371/journal.pone.0167958. So, now we see why we think we have giant flowers off the coast of Peru.

The rest of the line has information on the match: the oceanic sequence has a 0.96 identity with the plant sequence, has 79 matches, 3 mismatches, and no gaps, which suggests that this is a short sequence. Going digging in the FASTA file I found the raw sequence, and it is indeed very short:

>ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1
GTCTAAGTGTCGTGAGAAGTTCGTTGAACCTGATCATTTAGAGGAAGTAGAAGTCGTAAC
AAGGTTTCCGTAGGTGAACCTGCGGAAGG

This short string is the evidence for Rafflesia in the ocean. Out of curiosity I ran this sequence through BLAST:

Query  1     GTCTAAGTGTCGTGAGAAGTTCGTTGAACCTGATCATTTAGAGGAAGTAGAAGTCGTAAC  60
             |||||||||||||| |||||||||||||||| ||||||||||||||| ||||||||||||
Sbjct  1683  GTCTAAGTGTCGTGGGAAGTTCGTTGAACCTTATCATTTAGAGGAAGGAGAAGTCGTAAC  1742

Query  61    AAGGTTTCCGTAGGTGAACCTGCGGAAGG  89
             |||||||||||||||||||||||||||||
Sbjct  1743  AAGGTTTCCGTAGGTGAACCTGCGGAAGG  1771

The top hit is Bathycoccus prasinos, a picoplanktonic alga with a world-wide distribution. This seems like a more plausible identification for this sequence (all the top 100 hits are very similar to each other, many are labelled as Bathycoccus).

So, there's something clearly amiss with the analysis of this dataset. Someone who knows more about metagenomics than I do will be better placed to explain why this pipeline got it so wrong, and how common this issue is.

Given the scale and automation of metagenomics, there will always be errors - that is inevitable. What we need is ways to catch those errors, especially ones that are going to "pollute" existing distribution data with spurious records (flowers in the ocean). And in a sense, GBIF excels at this in that it exposes data to a wider audience. If you work on marine microbiology you might not notice that your sequences apparently include forest plants, but if you work on forest plants you will almost certainly notice sequences occurring in the ocean.

A key feature of GBIF that makes it so useful is that, unlike many data repositories, it does not treat data as a "black box". GBIF is not like a library catalogue which merely tells you that they have books and where to find them, instead it is like Google Books, which can tell you which books contain a given phrase you are looking for. By opening up each dataset and indexing the contents, GBIF enables us to do data analysis (in much the same way that GenBank isn't just a catalogue of sequences, it enables you to search the sequences themselves).

This is a feature we risk losing if we treat metagenomics data as a black box. The Tara Oceans data that GBIF receives is simply a list of taxa at a locality, it's a checklist. We have to take it on trust that the taxonomic assignments are accurate, and it is not a trivial task to diagnose errors. Compare this to having the photo that accompanied the record from Madagascar, which helps us determine that the identification is wrong. Going forward, it would be helpful if we had metagenomic sequences available as part of the data download from GBIF. It's also worth considering whether GBIF should start doing its own analysis of sequence data, or asking its contributors to check that their taxonomic assignments are correct (e.g., running BLAST on the data). Otherwise GBIF users may end up having to filter their data for a growing number of completely spurious records.

Update

Looks like (occurrence 1261055923) is Langsdorffia:

Related to the blog post https://t.co/do5P2u5Ooj, does anyone have ideas on the identity of this species (@GBIF record https://t.co/zSr8xemkh7 from Madagascar) which is labelled "Rafflesia" pic.twitter.com/qiRtYH0tUh
— Roderic Page (@rdmpage) December 20, 2019

Thursday, November 15, 2018

Geocoding genomic databases using GBIF

LwyH1HFe 400x400 I've put a short note up on bioRxiv about ways to geocode nucleotide sequences in databases such as GenBank. The preprint is "Geocoding genomic databases using GBIF" https://doi.org/10.1101/469650.

It briefly discusses using GBIF as a gazetteer (see https://lyrical-money.glitch.me for a demo) to geocode sequences, as well as other approaches such as specimen matching (see also Nicky Nicolson's cool work "Specimens as Research Objects: Reconciliation across Distributed Repositories to Enable Metadata Propagation" https://doi.org/10.6084/m9.figshare.7327325.v1).

Hope to revisit this topic at some point, for now this preprint is a bit of a placeholder to remind me of what needs to be done.

Wednesday, October 24, 2018

GBIF Ebbe Nielsen Challenge update

🎉 🎉 CONGRATULATIONS to @UofGlasgow's @rdmpage for winning joint first prize in the 2018 @GBIF Ebbe Nielsen Challenge, the annual innovation competition to advance open science and open data for biodiversity. Well done, Prof Page! @IBAHCM #UofGWorldChangers pic.twitter.com/DjbmP90hTN
— UofG MVLS (@UofGMVLS) October 17, 2018

Quick note to express my delight and surprise that my entry for the 2018 GBIF Ebbe Nielsen Challenge come in joint first! My entry was Ozymandias - a biodiversity knowledge graph which built upon data from sources such as ALA, AFD, BioStor, CrossRef, ORCID), Wikispecies, and BLR.

I'm still tweaking Ozymandias, for example adding data on GBIF specimens (and maybe sequences from GenBank and BOLD) so that I can explore questions such as what is the lag time between specimen collection and description of a species. The bigger question I'm interested in is the extent to which knowledge graphs (aka RDF) can be used to explore biodiversity data.

For details on the other entries visit the list of winners at GBIF. The other first place winners Lien Reyserhove, Damiano Oldoni and Peter Desmet have generously donated half their prize to NumFOCUS which supports open source data science software:

Let’s give back! ❤️We decided to donate half of our @GBIF prize money (5000€) to @NumFOCUS which funds essential #opensource research software like @rOpenSci, #pandas and @ProjectJupyter. https://t.co/R3Bh0Tj8YX
— LifeWatch INBO (@LifeWatchINBO) October 18, 2018

This is a great way of acknowledging the debt many of us owe to developers of open source software that underpins the work of many researchers.

I hope GBIF and the wiser GBIF community found this year's Challenge to be worthwhile, I'm a big fan of anything which increases GBIF's engagement with developers and data analysts, and if the challenge runs again next year I encourage anyone with an interest in biodiversity informatics to consider taking part.

Monday, August 20, 2018

GBIF Challenge Entry: Ozymandias

I've submitted an entry for the 2018 GBIF Ebbe Nielsen Challenge. It's a couple of weeks before the deadline but I will be away then so have decided to submit early.

My entry is Ozymandias - a biodiversity knowledge graph. The name is a play on "Oz" being nickname for Australia (much of the data for the entry comes from Australia), and Ozymandias, which is a poem about hubris, and attempting to link biodiversity data requires a certain degree of hubris.

The submission process for the challenge is unfortunately rather opaque compared to previous years when entries were visible to all, so participants could see what other people were submitting, and also knew the identity of the judges, etc. In the spirit of openness here is my video summarising my entry:

Ozymandias - GBIF Challenge Entry from Roderic Page on Vimeo.

There is also a background document here: https://docs.google.com/presentation/d/1UglxaL-yjXsvgwn06AdBbnq-HaT7mO4H5WXejzsb9MY/edit?usp=sharing.

I suspect this entry is not at all what the challenge is looking for, but I've used the challenge as a deadline so that I get something out the door rather than endlessly tweaking a project that only I can see. There will, of course, be endless tweaking as I explore further ways to link data, but at least this way there is something people can look at. Now, I need to spend some time writing up the project, which will require yet more self discipline to avoid the endless tweaking.

Thursday, July 05, 2018

GBIF at 1 billion - what's next?

How to cite: Page, R. (2018). GBIF at 1 billion - what's next? https://doi.org/10.59350/d8dwz-3v524

GBIF has reached 1 billion occurrences which is, of course, something to celebrate:

#GBIF1billion has arrived! Merci beaucoup, @Le_Museum @INPN_MNHN et @gbiffrance!

Thanks and congratulations, too, to the 1,217 data publishers and 92 participants who make the GBIF network go! More details to follow Thursday (champagne doesn't drink itself)… pic.twitter.com/xQ2f5fIt2x
— GBIF (@GBIF) July 4, 2018

An achievement on this scale represents a lot of work by many people over many years, years spent developing simple standards for sharing data, agreeing that sharing is a good thing in the first place, tools to enable sharing, and a place to aggregate all that shared data (GBIF).

So, I asked a question:

So I guess the real #GBIF1billion question is what can we do with a billion data points that we couldn't do with, say, a hundred million? Does more data simply mean more of same kind of analyses, or does it enable something new (and exciting)? @GBIF
— Roderic Page (@rdmpage) July 4, 2018

My point is not to do this:

Hey, don't spoil the party!
— Dimitri Brosens (@Dimibro) July 4, 2018

Rather it is to encourage a discussion about what happens when we have large amounts of biodiversity data. Is it the case that as we add data we simply enable more of the same kind of science, only better (e.g., more data for species distribution modelling), or do we reach a point where new things become possible?

To give a concrete example, consider iNaturalist. This started out as a Masters project to collect photos of organisms on Flickr. As you add more images you get better coverage of biodiversity, but you still have essentially a bunch of pictures. But once you have LOTS of pictures, and those are labelled with species names, you reach the point where it is possible to do something much more exciting - automatic species identification. To illustrate, I recently took the photos below:

Note the reddish tubular growths on the leaves. I asked iNaturalist to identify these photos and within a few seconds it came back with Eriophyes tiliae, the Red Nail Gall Mite. This feels like magic. It doesn't rely on complicated analysis of the image (as many earlier efforts at automated identification have done) it simply "knows" that images that look like this are typically of the galls of this mite because it has seen many such images before. (Another example of the impact of big data is Google Translate, initially based on parsing lots of examples of the same text in multiple languages.)

Okay, but then not sure I see what you're looking for. Why would 1 billion, as opposed to, say, 100 million, mean a paradigm shift? Do you have any (even hypothetical) answers to suggest yourself?
— Leif Schulman (@Leif_Sch) July 5, 2018

The "1 billion" number is not, by itself, meaningful. It's rather that I hope that while we're popping the champagne and celebrating a welcome, if somewhat arbitrary milestone, I'm hoping that someone, somewhere is thinking about whether biodiversity data on this scale enables something new.

Do I have answers? Not really, but here's one fairly small-scale example. One of the big challenges facing GBIF is getting georeferenced data. We spend a lot of time using a variety of tools and databases to convert text descriptions one collection localities into latitude and longitude. Many of these descriptions include phrases such as "5 mi NW of" and so we've developed parsers to attempt to make sense of these. All of these phrases and the corresponding latitude and longitude coordinates have ended up in GBIF. Now, this raises the possibility that after a point, pretty much any locality phrase will be in GBIF, so a way to georeference a locality is simply to search GBIF for that locality and use the associated latitude and longitude. GBIF itself becomes the single best tool to georeference specimen data. To explore this idea I've built a simple tool on glitch https://lyrical-money.glitch.me that takes a locality description and geocodes it using GBIF.

You paste in a locality string and it attempt to find that on a map based on data in GBIF. This could be automated, so you could imagine being able to georeference whole collections as part of the process of uploading the data to GBIF. Yes, the devil is in the details, and we'd need ways to flag errors or doubtful records, but the scale of GBIF starts of open up possibilities like this.

So, my question is, "what's next?".

Monday, June 04, 2018

Towards a biodiversity token: Bitcoin, FinTech, and a radical suggestion for the GBIF Challenge

8VlGI2hk 400x400 First off, let me say that what follows is a lot of arm waving to try and obscure how little I understand what I'm talking about. I'm going to sketch out what I think is a "radical" idea for a GBIF Challenge entry.

TL;DR GBIF should issue it's own cryptocurrency and use that to fund the development of the GBIF network by charging for downloading cleaned, processed data (original provider data remains free). People can buy subscriptions to get access to data, and/or purchase GBIF currency as a contribution or investment. Proceeds from the purchase of cleaned data are divided between GBIF (to fund the portal), the data providers (to reward them making data available) and the GBIF nodes in countries included in the geographic coverage of the data (to help them build their biodiversity infrastructure). The challenge entry would involve modelling this idea and conducting simulations to test it's efficacy.

The motivation for this idea comes from several sources:

1. GBIF is (under-)funded by direct contributions from governments, hence each year it essentially "begs" for money. Several rich countries (such as the United Kingdom) struggle to pay the fairly paltry sums involved. Part of the problem is that they are taking something of demonstrable value (money) and giving it to an organisation (GBIF) which has no demonstrable financial value. Hence the argument for funding GBIF is basically "it's the right thing to do". This is not really a tenable or sustainable model.

2. Many web sites provide information for "free" in that the visitor doesn't pay any money. Instead the visitor views ads and, whether they are aware if it or not, are handing over large amounts of data about themselves and their behaviour (think the recent scandal involving Facebook).

3. Some people are rebelling against the "free with ads" by seeking other ways to fund the web. For example, the Brave web browser enables you to buy BATS (Basic Attention Tokens, based on Ethereum). You can choose to send BATS to web sites that you visit (and hence find valuable). Those sites don't need to harvest tyour data or bombard you with ads to receive an income.

4. Cryptocurrency is being widely explored as a way to raise funding for new ventures. Many of these are tech-based, but there are some interesting developments in conservation and climate change, such as Veridium which offsets carbon emissions. There are links between efforts like Veridium and carbon offset programmes such as the Rimba Raya Biodiversity Reserve, so you can go from cryptocurrency to trees.

5. The rather ugly, somewhat patronising furore that erupted when Rwanda decided that the best way to increase its foreign currency earnings (as a step towards ultimately freeing itself from dependency on development aid) was to sign a sponsorship deal with Arsenal football club.

Now, imagine a situation where GBIF has a cryptocurrency token (e.g., the "GBIF coin"). Anyone, whether a country, an organisation, or an individual can buy GBIF coins. If you want to download GBIF data, you will need to pay in GBIF coins, either per-download or via a monthly subscription. The proceeds from each download are split in a way that supports the GBIF network as a whole. For example, imagine GBIF itself gets 30% (like Apple's App Store). The remaining 70% gets' split between (a) the data providers and (b) the GBIF nodes in countries included in the data download. For example, almost all the data on a country such as Rwanda does not come from Rwanda itself, but from other countries. You want to reward anyone who makes data available, but you also want to support the development of a biodiversity data infrastructure in Rwanda (or any other country), so part of the proceeds go to the GBIF node in Rwanda.

Now, an immediate issue (apart from the merits or otherwise of blockchains and cryptocurrency) is that I'm advocating charging for access to data, which seems antithetical to open access. To be clear, I think open access is crucial. I'm suggesting that we distinguish between two classes of data. The first is the data as it is provided to GBIF. That is almost always open data under a CC0 license, and that remains free. But if you ant it for free it is served as it is received. In other words, for free access to data GBIF is essentially a dumb repository (like, say, Dryad). The data is there, you can search the metadata for each dataset, so essentially you get something like the current dataset search.

The other thing GBIF does is that it processes the data, cleaning it, reconciling names and locations, and indexing it, so that if you want to search for a given species, GBIF summarises the data across all the datasets and (often) presents you with a better result that if you'd downloaded all the original data and simply merged it together yourself. This is a valuable service, and its one of the reasons why GBIF costs money to run. So imagine that we do something like this:

It is free to browse GBIF as a person and explore the data
It is free to download the raw data provided by any data publisher.
It costs to download cleaned data that corresponds to a specific query, e.g. all records for a particular taxon, geographic area, etc.
Payment for access to cleaned data is via the GBIF coin.
The cost is small, on the scale of buying a music track or subscribing to Spotify.

Now, I don't expect GBIF to embrace this idea anytime soon. By nature it's a conservative, risk-averse organisation. But I think something like this idea deserves serious attention, ideally from people with much better understanding of the issues that my own "I saw this on Twitter therefore it must be cool" level. One way to move forward would be to model how such a system would work, based for example on data on web site visits and data downloads on the current GBIF portal. I suspect models could be built to give some idea of whether such an approach would be financially viable. It occurs to me that something like this would make a great GBIF Challenge entry, particularly as it is gives a license for thinking the unthinkable with no risk to GBIF itself.

Wednesday, May 09, 2018

2018 GBIF Ebbe Nielsen Challenge now open

Http images ctfassets net uo17ejk9rkwj L6lRFOvdQG4M4yY0k0Cei ad53f85a57368b017fecb8907393d32a ebbe 2018 Last year I finished my four-year stint as Chair of the GBIF Science Committee. During that time, partly as a result of my urging, GBIF launched an annual "GBIF Ebbe Nielsen Challenge", and I'm please that this year GBIF is continuing to run the challenge. In 2015 and 2016 the challenge received some great entries.

Last year's challenge (GBIF Challenge 2017: Liberating species records from open data repositories for scientific discovery and reuse didn't attract quite the same degree of attention, and GBIF quietly didn't make an award. I think part of the problem was that there's a fine balance between having a wide open challenge which attracts all sorts of interesting entries, some a little off the wall (my favourite was GBIF data converted to 3D plastic prints for physical data visualisation) versus a specific topic which might yield one or more tools that could, say, be integrated into the GBIF portal. But if you make it too narrow then you run the risk of getting fewer entries, which is what happened in 2017. Ironically, since the 2017 challenge I've come across work that would have made a great entry, such as a thesis by Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via. Purpose Recognition, see also Bernardo, I. R., Borges, M., Baranauskas, M. C. C., & Santanchè, A. (2015). Interpretation of Construction Patterns for Biodiversity Spreadsheets. Lecture Notes in Business Information Processing, 397–414. doi:10.1007/978-3-319-22348-3_22.

This year the topic is pretty open:

The 2018 Challenge will award €34,000 for advancements in open science that feature tools and techniques that improve the access, utility or quality of GBIF-mediated data. Under this open-ended call, challenge submissions may build on existing tools and features, such as the GBIF API, Integrated Publishing Toolkit, data validator, relative species occurrence tool, among others—or develop new applications, methods, workflows or analyses.

Lots of scope, and since I'm not longer part of the GBIF Science Committee it's tempting to think about taking part. The judging criteria are pretty tough and results-oriented:

Winning entries will demonstrably extend and increase the usefulness, openness and visibility of GBIF-mediated data for identified stakeholder groups. Each submission is expected to demonstrate advantages for at least three of the following groups: researchers, policymakers, educators, students and citizen scientists.

So, maybe less scope for off-the-wall stuff, but an incentive to clearly articulate why a submission matters.

The actual submission process is, sadly, rather more opaque than in previous years where it was run in the open on Devpost where you can still see previous submissions (e.g., those for 2015). Devpost has lots of great features but isn't cheap, so the decision is understandable. Maybe some participants will keep the rest of the community informed via, say, Twitter, or perhaps people will keep things close to their chest. In any event, I hope the 2018 challenge inspires people to think about doing something both cool and useful with biodiversity data. Oh, and did I mention that a total of €34,000 in prizes is up for grabs? Deadline for submission is 5 September 2018.

Wednesday, January 24, 2018

Guest post: The Not problem

The following is a guest post by Bob Mesibov.

Nico Franz and Beckett Sterner created a stir last year with a preprint in bioRxiv about expert validation (or the lack of it) in the "backbone" classifications used by aggregators. The final version of the paper was published this month in the OUP journal Database (doi:10.1093/database/bax100).

To see what effect "backbone" taxonomies are having on aggregated occurrence records, I've recently been auditing datasets from GBIF and the Atlas of Living Australia. The results are remarkable, and I'll be submitting a write-up of the audits for formal publication shortly. Here I'd like to share the fascinating case of the genus Not Chan, 2016.

I found this genus in GBIF. A Darwin Core record uploaded by the New Zealand Arthropod Collection (NZAC02015964) had the string "not identified on slide" in the scientificName field, and no other taxonomic information.

GBIF processed this record and matched it to the genus Not Chan, 2016, which is noted as "doubtful" and "incertae sedis".

There are 949 other records of this genus around the world, carefully mapped by GBIF. The occurrences come from NZAC and nine other datasets. The full scientific names and their numbers of GBIF records are:

Number	Name
2	Not argostemma
14	not Buellia
1	not found, check spelling
1	Not given (see specimen note) bucculenta
1	Not given (see specimen note) ortoni
1	Not given (see specimen note) ptychophora
1	Not given (see specimen note) subpalliata
1	not identified on slide
1	not indentified
1	Not known not known
1	Not known sp.
1	not Lecania
4	Not listed
873	Not naturalised in SA sp.
18	Not payena
5	not Punctelia
18	not used
6	Not used capricornia Pleijel & Rouse, 2000

GBIF cites this article on barnacles as the source of the genus, although the name should really be Not Chan et al., 2016. A careful reading of this article left me baffled, since the authors nowhere use "not" as a scientific name.

Next I checked the Catalogue of Life. Did CoL list this genus, and did CoL attribute it to Chan? No, but "Not assigned" appears 479 times among the names of suprageneric taxa, and the December 2018 CoL checklist includes the infraspecies "Not diogenes rectmanus Lanchester,1902" as a synonym.

The Encyclopedia of Life also has "Not" pages, but these have in turn been aggregated on the "EOL pages that don't represent real taxa" page, and under the listing for the "Not assigned36" page someone has written:

This page contains a bunch of nodes from the EOL staff Scratchpad. NB someone should go in and clean up that classification.

"Someone should go in and clean up that classification" is also the GBIF approach to its "backbone" taxonomy, although they think of that as "we would like the biodiversity informatics community and expert taxonomists to point out where we've messed up". Franz and Sterner (2018) have also called for collaboration, but in the direction of allowing for multiple taxonomic schemes and differing identications in aggregated biodiversity data. Technically, that would be tricky. Maybe the challenge of setting up taxonomic concept graphs will attract brilliant developers to GBIF and other aggregators.

Meanwhile, Not Chan, 2016 will endure and aggregated biodiversity records will retain their vast assortment of invalid data items, character encoding failures, incorrect formatting, duplications and truncated data items. In a post last November on the GitHub CoL+ pages I wrote:

Being old and cynical, I can speculate that in the time spent arguing the "politics" of aggregation in recent years, a competent digital librarian or data scientist would have fixed all the CoL issues and would be halfway through GBIF's. But neither of those aggregators employ digital librarians or data scientists, and I'm guessing that CoL+ won't employ one, either.

Friday, October 06, 2017

Notes on finding georeferenced sequences in GenBank

Notes on how many georeferenced DNA sequences there are in GenBank, and how many could potentially be georeferenced.

BCT	Bacterial sequences
PRI	Primate sequences
ROD	Rodent sequences
MAM	Other mammalian sequences
VRT	Other vertebrate sequences
INV	Invertebrate sequences
PLN	Plant and Fungal sequences
VRL	Viral sequences
PHG	Phage sequences
RNA	Structural RNA sequences
SYN	Synthetic and chimeric sequ
UNA	Unannotated sequences

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=nucleotide nucleotides
&term=ddbj embl genbank with limits[filt]
NOT transcriptome[All Fields] ignore transcriptome data
NOT mRNA[filt] ignore mRNA data
NOT TSA[All Fields] ignore TSA
NOT scaffold[All Fields] ignore scaffold
AND src lat lon[prop] include records that have source feature "lat_lon"
AND 2010/01/01:2010/12/31[pdat] from this date range
AND gbdiv_pri[PROP] restrict search to PRI division (primates)
AND srcdb_genbank[PROP] Need this if we query by division, see NBK49540

Numbers of nucleotide sequences that have latitude and longitudes in GenBank for each year.

Date	PRI	ROD	MAM	VRT	INV	PLN
2010/01/01	4	127	2552	9551	92692	7174
2011/01/01	371	1204	8160	17657	78494	7968
2012/01/01	6	5803	4214	21696	84060	27314
2013/01/01	297	349	761	10764	70411	23435
2014/01/01	15	2904	4761	14598	68076	14018
2015/01/01	174	527	1983	17843	363538	35501
2016/01/01	58	2615	1263	14898	757893	22813
2017/01/01	938	1758	1017	10712	75066	28180

Numbers of nucleotide sequences that don't have latitude and longitudes in GenBank for each year but do have the country field and hence could be georeferenced.

Date	PRI	ROD	MAM	VRT	INV	PLN
2010/01/01	6660	2654	5534	32666	62577	56692
2011/01/01	3998	3266	6210	33717	74015	98664
2012/01/01	5377	5590	7283	55332	86945	103379
2013/01/01	10928	4805	8013	66373	69719	95817
2014/01/01	9727	3492	6751	59913	77816	135372
2015/01/01	8922	6774	13964	60578	85867	167337
2016/01/01	6430	3384	10860	62238	95711	145111
2017/01/01	11474	3520	4912	41159	91219	109747

Tuesday, October 03, 2017

TDWG 2017: thoughts on day 1

Some random notes on the first day of TDWG 2017. First off, great organisation with the first usable conference calendar app that I've seen (https://tdwg2017.sched.com).

I gave the day's keynote address in the morning (slides below).

Towards a biodiversity knowledge graph from Roderic Page

It was something of a stream of consciousness brain dump, and tried to cover a lot of (maybe too much) stuff. Among the topics I covered were Holly Bik's appeal for better links between genomic and taxonomic data, my iSpecies tool, some snarky comments on the Semantic Web (and an assertion that the reason that GenBank succeeded was due more to network effects than journals requiring authors to submit sequences there), a brief discussion of Wikidata (including using d3sparql to display classifications, see here), and the use of Hexastore to query data from BBC Wildlife. I also talked about Ted Nelson, Xanadu, using hypothes.is to annotate scientific papers (see Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday), social factors in building knowledge graphs (touching on ORCID and some of the work by Nico Franz discussed here), and ended with some cautionary comments on the potential misuse of metrics based on knowledge graphs (using "league tables" of cited specimens, see GBIF specimens in BioStor: who are the top ten museums with citable specimens?).

TDWG is a great opportunity to find out what is going on in biodiversity informatics, and also to get a sense of where the problems are. For example, sitting through the Financial Models for Sustaining Biodiversity Informatics Products session you couldn't help being struck by (a) the number of different projects all essentially managing specimen data, and (b) the struggle they all face to obtain funding. If this was a commercial market there would be some pretty drastic consolidation happening. It also highlights the difficulty of providing services to a community that doesn't have much money.

I was also struck by Andrew Bentley's talk Interoperability, Attribution, and Value in the Web of Natural History Museum Data. In a series of slides Andrew outlined what he felt collections needed from aggregators, researchers, and publishers, e.g.:

What do collections want from aggregators like @GBIF ? #tdwg17 pic.twitter.com/WRmeafSbtv
— Roderic Page (@rdmpage) October 2, 2017

Chatting to Andrew at the evening event at the Canadian Museum of Nature, I think there's a lot of potential for developing tools to provide collections with data on the use and impact of their collections. Text mining the biodiversity literature on a massive scale to extract (a) mentions of collections (e.g., their institutional acronyms) and (b) citations of specimens could generate metrics that would be helpful to collections. There's a great opportunity here for BHL to generate immediate value for natural history collections (many of which are also contributors to BHL).

Also had a chance to talk to Jorrit Poelen who works on Global Biotic Interactions (GloBI). He made some interesting comparisons between Hexastores (which I'd touched on in my keynote) and Linked Data Fragments.

The final session I attended was Towards robust interoperability in multi-omic approaches to biodiversity monitoring. The overwhelming impression was that there is a huge amount of genomic data, much of which does not easily fit into the classic, Linnean view of the world that characterises, say, GBIF. For most of the sequences we don't know what they are, and that might not be the most interesting question anyway (more interesting might be "what do they do?"). The extent to which these data can be shoehorned into GBIF is not clear to me, although doing so may result in some healthy rethinking of the scope of GBIF itself.

Monday, September 18, 2017

Guest post: Our taxonomy is not your taxonomy

The following is a guest post by Bob Mesibov.

Do you know the party game "Telephone", also known as "Chinese Whispers"? The first player whispers a message in the ear of the next player, who passes the message in the same way to a third player, and so on. When the last player has heard the whispered message, the starting and finishing versions of the message are spoken out loud. The two versions are rarely the same. Information is usually lost, added or modified as the message is passed from player to player, and the changes are often pretty funny.

I recently compared ca 100 000 beetle records as they appear in the Museums Victoria (NMV) database and in DarwinCore downloads from the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF). NMV has its records aggregated by ALA, and ALA passes its records to GBIF. The "Telephone" effect in the NMV to ALA to GBIF comparison was large and not particularly funny.

Many of the data changes occur in beetle names. ALA checks the NMV-supplied names against a look-up table called the National Species List, which in this case derives from the Australian Faunal Directory (AFD). If no match is found, ALA generalises the record to the next higher supplied taxon, which it also checks against the AFD. ALA also replaces supplied names if they are synonyms of an accepted name in the AFD.

GBIF does the same in turn with the names it gets from ALA. I'm not 100% sure what GBIF uses as beetle look-up table or tables, but in many other cases their GBIF Backbone Taxonomy mirrors the Catalogue of Life.

To give you some idea of the magnitude of the changes, of ca 85000 NMV records supplied with a genus+species combination, about one in five finished up in GBIF with a different combination. The "taxonRank" changes are summarised in the overview below, and note that replacement ALA and GBIF taxon names at the same rank are often different:

Of the species that escaped generalisation to a higher taxon, there are 42 names with genus triples: three different genus names for the same taxon in NMV, ALA and GBIF.

Just one example: a paratype of the staphylinid Schaufussia mona Wilson, 1926 is held in NMV. The record is listed under Rytus howittii (King, 1866) in the ALA Darwin Core download, because AFD lists Schaufussia mona as a junior subjective synonym of Tyrus howitti King, 1866, and Tyrus howittii in AFD is in turn listed as a synonym of Rytus howittii (King, 1866). The record appears in GBIF under Tyraphus howitti (King, 1865), with Rytus howittii (King, 1866) listed as a synonym. In AFD, Rytus howittii is in the tribe Tyrini, while Tyraphus howitti is a different species in the tribe Pselaphini.

ALA gives "typeStatus" as "paratype" for this record, but the specimen is not a paratype of Rytus howittii. In the GBIF download, the "typeStatus" field is blank for all records. I understand this may change in future. If it does, I hope the specimen doesn't become a paratype of Tyraphus howitti through copying from ALA.

There are lots of "Telephone" changes in non-taxonomic fields as well, including some geographical howlers. ALA says that a Kakadu National Park record is from Zambia and another Northern Territory record is from Mozambique, because ALA trusts the incorrect longitude provided by NMV more than it does the NMV-supplied locality text. GBIF blanks this locality text field, leaving the GBIF user with two African records for Australian specimens and no internal contradictions.

ALA trusts latitude/longitude to the extent of changing the "stateProvince" field for localities near Australian State borders, if a low-precision latitude/longitude places the occurrence a short distance away in an adjoining State.

Manglings are particularly numerous in the "recordedBy" field, where name strings are reformatted, not always successfully. Complex NMV strings suffer worst, e.g. "C Oke; Charles John Gabriel" in NMV becomes "Oke, C.|null" in ALA, and "Ms Deb Malseed - Winda-Mara Aboriginal Corporation WMAC; Ms Simone Sailor - Winda-Mara Aboriginal Corporation WMAC" is reformatted as in ALA "null|null|null|null"

Most of the "Telephone" effect in the NMV-ALA-GBIF comparison appears in the NMV-ALA stage. I contacted ALA by email and posted some of the issues on the ALA GitHub site; I haven't had a response and the issues are still open. I also contacted Tim Robertson at GBIF, who tells me that GBIF is working on the ALA-GBIF stage.

Can you get data as originally supplied by NMV to ALA, through ALA? Well, that's easy enough record-by-record on the ALA website, but not so easy (or not possible) for a multi-record download. Same with GBIF, but in this case the "original" data are the ALA versions.

Friday, June 30, 2017

Response to To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data

Nico Franz and Beckett W. Sterner recently published a preprint entitled "To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data" on bioRxiv http://dx.doi.org/10.1101/157214

Below is the abstract:

Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors "at the source". We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies ≠ frequently called "backbones" ≠ they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an underappreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e., unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.

Below I respond to some specific points that annoyed me about this article, at the end I try and sketch out a more constructive response. Let me stress that although I am the current Chair of the GBIF Science Committee, the views expressed here are entirely my own.

Trust and social relations

Trust is a complex and context-sensitive concept...First, trust is a dependence relation between a person or organization and another person or organization. The first agent depends on the second one to do something important for it. An individual molecular phylogeneticist, for example, may rely on GenBank (Clark et al. 2016) to maintain an up-to-date collection of DNA sequences, because developing such a resource on her own would be cost prohibitive and redundant. Second, a relation of dependence is elevated to being one of trust when the first agent cannot control or validate the second agent's actions. This might be because the first agent lacks the knowledge or skills to perform the relevant task, or because it would be too costly to check.

Trust is indeed complex. I found this part of the article to be fascinating, but incomplete. The social network GBIF operates in is much larger than simply taxonomic experts and GBIF, there are relationships with data providers, other initiatives, a broad user community, government agencies that approve it's continued funding, and so on. Some of the decisions GBIF makes need to be seen in this broader context.

For example, the article challenges GBIF for responding to errors in the data by saying that these should be "corrected at source". This a political statement, given that data providers are anxious not to ceed complete control of their data to aggregators. Hence the model that GBIF users see errors, those errors get passed back to source (the mechanisms for tis is mostly non-existent), the source fixes it, then the aggregator re-harvests. This model makes assumptions about whether sources are either willing or able to fix these errors that I think are not really true. But the point is this is less about not taking responsibility, but instead avoiding treading on toes by taking too much responsibility. Personally I think should take responsibility for fixing a lot of these errors, because it is GBIF whose reputation suffers (as demonstrated by Franz and Sterner's article).

Scalability

A third step is to refrain from defending backbones as the only pragmatic option for aggregators (Franz 2016). The default argument points to the vast scale of global aggregation while suggesting that only backbones can operate at that scale now. The argument appears valid on the surface, i.e., the scale is immense and resources are limited. Yet using scale as an obstacle it is only effective if experts were immediately (and unreasonably) demanding a fully functional, all data-encompassing alternative. If on the other hand experts are looking for token actions towards changing the social model, then an aggregator's pursuit of smaller-scale solutions is more important than succeeding with the 'moonshot'.

Scalability is everything. GBIF is heading towards a billion occurrence records and several million taxa (particularly as more and more taxa from DNA-barcoding taxa are added). I'm not saying that tractability trounces trust, but it is a major consideration. Anybody advocating a change has got to think about how these changes will work at scale.

I'm conscious that this argument could easily be used to swat away any suggestion ("nice idea, but won't scale") and hence be a reason to avoid change. I myself often wish GBIF would do things differently, and run into this problem. One way around it is to make use of the fact that GBIF has some really good APIs, so if you want GBIF to do something different you can build a proof of concept to show what could be done. If that is sufficiently compelling, then the case for trying to scale it up is going to be much easier to make.

Multiple classifications

As a social model, the notion of backbones (Bisby 2000) was misguided from the beginning. They disenfranchise systematists who are by necessity consensus-breakers, and distort the coherence of biodiversity data packages that reflect regionally endorsed taxonomic views. Henceforth, backbone-based designs should be regarded as an impediment to trustworthy aggregation, to be replaced as quickly and comprehensively as possible. We realize that just saying this will not make backbones disappear. However, accepting this conclusion counts as a step towards regaining accountability.

This strikes me as hyperbole. "They disenfranchise systematists who are by necessity consensus-breakers". Really? Having backbones in no way prevents people doing systematic research, challenging existing classifications, or developing new ones (which, if they are any good, will become the new consensus).

We suggest that aggregators must either author these classification theories in the same ways that experts author systematic monographs, or stop generating and imposing them onto incoming data sources. The former strategy is likely more viable in the short term, but the latter is the best long-term model for accrediting individual expert contributions. Instead of creating hierarchies they would rather not 'own' anyway, aggregators would merely provide services and incentives for ingesting, citing, and aligning expert-sourced taxonomies (Franz et al. 2016a).

Backbones are authored in the sense that they are the product of people and code. GBIF's is pretty transparent (code and some data on github, complete with a list of problems). Playing Devil's advocate, maybe the problem here is the notion of authorship. If you read a paper with 100's of authors, why does that give you any greater sense of accountabily? Is each author going to accept responsibility for (or being to talk cogently about) every aspect of that paper? If aggregators such as GBIF and Genbank didn't provide a single, simple way to taxonomically browse the data I'd expect it would be the first thing users would complain about. There are multiple communities GBIF must support, including users who care not at all about the details of classification and phylogeny.

Having said that, obviously these backbone classifications are often problematic and typically lag behind current phylogenetic research. And I accept that they can impose a certain view on how you can query data. GenBank for a long time did not recognise the Ecdysozoa (nematodes plus arthropods) despite the evidence for that group being almost entirely molecular. Some of my research has been inspired by the problem of customising a backbone classification to better more modern views (doi:10.1186/1471-2105-6-208).

If handling multiple classifications is an obstacle to people using or contributing data to GBIF, then that is clearly something that deserves attention. I'm a little sceptical, in that I think this is similar to the issue of being able to look at multiple versions of a document or GenBank sequence. Everyone says it's important to have, I suspect very few people ever use that functionality. But a way forward might be to construct a meaningful example (in other words an live demo, not a diagram with a few plant varieties).

Ways forward

We view this diagnosis as a call to action for both the systematics and the aggregator communities to reengage with each other. For instance, the leadership constellation and informatics research agenda of entities such as GBIF or Biodiversity Information Standards (TDWG 2017) should strongly coincide with the mission to promote early-stage systematist careers. That this is not the case now is unfortunate for aggregators, who are thereby losing credibility. It is also a failure of the systematics community to advocate effectively for its role in the biodiversity informatics domain. Shifting the power balance back to experts is therefore a shared interest.

Having vented, let me step back a little and try and extract what I think the key issue is here. Issues such as error correction, backbones, multiple classifications are important, but I guess the real issue here is the relationship between experts such as taxonomists and systematists, and large-scale aggregators (note that GBIF serves a community that is bigger than just these researchers). Franz and Sterner write:

...aggregators also systematically compromise established conventions of sharing and recognizing taxonomic work. Taxonomic experts play a critical role in licensing the formation of high-quality biodiversity data packages. Systems of accountability that undermine or downplay this role are bound to lower both expert participation and trust in the aggregation process.

I think this is perhaps the key point. Currently aggregation tends to aggregate data and not provenance. Pretty much every taxonomic name has at one point or other been published by somebody. For various reasons (including the crappy way most nomenclature databases cite the scientific literature) by the time these names are assembled into a classification by GBIF the names have virtually no connection to the primary literature, which also means that who contributed the research that led to that name being minted (and the research itself) is lost. Arguably GBIF is missing an opportunity to make taxonomic and phylogenetic research more visible and discoverable (I'd argue this is a better approach than Quixotic efforts to get all biologists to always cite the primary taxonomic literature).

Franz and Sterner's article is a well-argued and sophisticated assessment of a relationship that isn't working the way it could. But to talk in terms of "power balance" strikes me as miscasting the debate. Would it not be better to try and think about aligning goals (assuming that is possible). What do experts want to achieve? What do they need to achieve those goals? Is it things such as access to specimens, data, literature, sequences? Visibility for their research? Demonstrable impact? Credit? What are the impediments? What, if anything, can GBIF and other aggregators do to help? In what way can facilitating the work of experts help GBIF?

In my own "early-stage systematist career" I had a conversation with Mark Hafner about the Louisiana State University Museum providing tissue samples for molecular sequencing, essentially a "project in a box". Although Mark was complaining about the lack credit for this (a familiar theme) the thing which struck me was how wonderful it would be to have such a service - here's everything you need to do your work, go do some science. What if GBIF could do the same? Are you interested in this taxonomic group, well here's the complete sum of what we know so far. Specimens, literature, DNA sequences, taxonomic names, the works. Wouldn't that be useful?

Franz and Sterner call for "both the systematics and the aggregator communities to reengage with each other". I would echo this. I think that the sometimes dysfunctional relationship between experts and aggregators is partly due to the failure to build a community of researchers around GBIF and its activities. The focus of GBIF's relationship with the scientific community has been to have a committee of advisers, which is a rather traditional and limited approach ("you're a scientist, tell us what scientists want"). It might be better served if it provided a forum for researchers to interact with GBIF, data providers, and each other.

I stated this blog (iPhylo) years ago to vent my frustrations about TreeBASE. At the time I was fond of a quote from a philosopher of science that I was reading, to the effect that we only criticise those things that we care about. I take Franz and Sterner's article to indicate that they care about GBIF quite a bit ;). I'm looking forward to more critical discussion about how we can reconcile the needs of experts and aggregators as we seek to make global biodiversity data both open and useful.