Monday, April 20, 2020

Making sense of how Wikidata models taxonomy

Given my renewed enthusiasm for Wikidata, I'm trying to get my head around the way that Wikidata models biological taxonomy. As a first pass, here's a diagram of the properties linked to a taxonomic name. The model is fairly comprehensive, it includes relationships between names (e.g, basionym, protonym, replacement), between taxa (e.g., parent taxon), and links to the literature. It's also a complex model to query, given that a lot of information is expressed using qualifiers. Hence there's a bit of head scratching while I figure out the relationship between properties, statements, etc.

Links to the literature is one of my interests, can in cases where Wikidata has this information you can start to enhance the way we display publications, e.g.

The Wikidata model is very like that used in Darwin Core, where everything is a taxon and every taxon has a name, which means that relationships that are notionally between names and not taxa (e.g., basionym) are all treated as relationships between taxa.

One big challenge is how to interpret Wikidata as a classification, given that we expect classifications to be trees. The taxonomic classification in Wikidata is clearly not a tree, for example:

What I think is happening here is that different people are adding different parent taxa, depending on which classification they follow. Some classifications (e.g., that used by GBIF) are "shallow" with only a few levels (e.g., kingdom, phylum, class, order, family, genus), other classifications are deep (e.g., NCBI). So the idea of simply being able to do a SPARQL query and get a tree (e.g. Displaying taxonomic classifications from Wikidata using d3js and SPARQL) runs into problems. But this could also be a strength, particularly if we had a reference or source for each parent child pair. That way we could (a) store multiple classifications in Wikidata, and (b) have queries that retreive classifications according to a particular source (e.g., GBIF).

So, lots of potential, but lots I've still to learn.

Friday, April 17, 2020

A planetary computer for Earth

Came across Microsoft's announcement of a "A planetary computer for a sustainable future through the power of AI", complete with a glossy video featuring Lucas Joppa @lucasjoppa (see also @Microsoft_Green and #AIforEarth).

On the one hand it's great to see super smart people with lots of resources tackling important questions, but it's hard not to escape the feeling that this is the classic technology company approach of framing difficult problems in ways that match the solutions they have to offer. Is the reason that biodiversity is declining simply because we have lacked computational resources, that our AI isn't good enough? And while forests that have been stripped of both their mega fauna and previous human inhabitants make for photogenic backdrops, biodiversity can be a lot messier (and dangerous). Still, it will be interesting to see how this plays out, and what sort of problems the planetary computer is used to tackle.

Monday, April 13, 2020

Wikidata and the bibliography of life in the time of coronavirus

I haven't posted on iPhylo for a while, and since my last post back in January things have obviously changed quite a bit. In late January and early February I was teaching a course on biodiversity informatics, and students discovered the John Hopkins coronavirus dashboard, which seemed like a cool way to display information on a situation that was happening on the other side of the world. All fairly abstract.

Today the dashboard looks rather different, and things are no longer quite so abstract (and, of course, never were for the population of Wuhan).

At the same time as the pandemic is affecting so many lives (and taking those of people who had a big impact on my childhood), there is the extraordinary case of open access champion Jon Tennant (@protohedgehog). On April 8th I received an item from his email newsletter entitled Converting adversity into productivity, detailing how he'd managed to get through a traumatic period prior to corona virus, and how productive he had managed to be (his email lists a whole slew of articles he'd written). The next day, this:

The day before, this happened:

Times like this tend to focus the mind, and for anyone with research skills the question arises "what should I be doing?". Some people are addressing issues directly or indirectly relate to the pandemic. It feels like every second post on Medium features someone playing data scientist with coronavirus data. Others are taking existing tools and projects and looking for ways to make them relevant to the problem, such as Plazi and Pensoft seeking to improve access to the biology of corona virus hosts, as part of their broader mission to make biodiversity information more accessible.

Another approach, in some ways what Jon Tennant did, is to use the time to focus on what you think matters and work on that. Of course, this assumes that you are fortunate enough to have the time and resources to do that. I have tenure and my children are grown up, life would be very different without a salary or with small children or other dependents.

One of the things I am increasingly focussing on is the idea of Wikidata as the "bibliography of life". Specifically, I want to get as much taxonomic and related literature into Wikidata, and want to link as much of that to freely-available versions of that literature (e.g., on Internet Archive), I want that literature embedded in the citation graph, linked to authors, and linked to the taxa treated in those papers. A lot of literature is already going into Wikidata via bots that consume the stream of papers with CrossRef DOIs and upload their details to Wikidata, but there is a huge corpus of literature that this approach overlooks. Not only do we have Digital libraries like the Biodiversity Heritage Library and JSTOR, but there is a long tail of small publishers making taxonomic literature available online, and I want this to all be equally discoverable.

One aspect of this project is to populate Wikidata with this missing literature. Over the years as part of projects such as BioNames and BioStor I have accumulated hundreds of thousands of bibliographic references. These aren't much use sitting on my hard drive. Adding them to Wikidata makes them more accessible, and also enables others to make them much richer. For example, the irrepressible @SiobhanLeachman regularly converts author strings to author things:

Adding things to Wikidata is fun, but it can be a struggle to get a sense of what is in Wikidata and how it is interconnected. So I've started to build a simple app that helps show me people, publications, journals, and taxa in a fairly conventional way, all powered by Wikidata. The app is live at It is not going to win any prizes for performance or design, but I find it useful.

Partly I'm trying to make the original articles more accessible, e.g.:

I'm keen to link taxonomists to their publications and ultimately the taxa they work on:

And we can link taxa and publications visually:

The community-based, somewhat chaotic consensus-driven approach of Wikidata can be frustrating ("well, if you'd asked ME, I wouldn't have done it that way"), but I think it's time to accept that this is simply the nature of the beast, and marvel at the notion that we have a globally accessible and editable knowledge graph. We can stay in our domain-specific silos, where we can control things but remain bereft of both users and contributors. However if we are willing to let go of that control, and accept that things won't always be done the way we think would be optimal, there is a lot of freedom to be gained by deferring to Wikidata's community decisions and simply getting on with building the bibliography of life. Maybe that is something worthwhile to do in this time of coronavirus.

Monday, March 23, 2020

Darwin Core Million promo: best and worst

Bob mesibovThe following is a guest post by Bob Mesibov.
There's still time (to 31 March) to enter a dataset in the 2020 Darwin Core Million, and by way of encouragement I'll celebrate here the best and worst Darwin Core datasets I've seen.
The two best are real stand-outs because both are collections of IPT resources rather than one-off wonders.

The first is published by the Peabody Museum of Natural History at Yale University. Their IPT website has 10 occurrence datasets totalling ca 1.6M records updated daily, and I've only found minor data issues in the Peabody offerings. A recent sample audit of the 151,138 records with 70 populated Darwin Core fields in the botany dataset (as of 2020-03-18) showed refreshingly clean data:
  • entries correctly assigned to DwC fields
  • no missing-but-expected entry gaps
  • consistent, widely accepted vocabularies and formatting in DwC fields
  • no duplicate records
  • no character encoding errors
  • no gremlin characters
  • no excess whitespace or fancy alternatives to simple ASCII characters
The dataset isn't perfect and occurrenceRemarks entries are truncated at 254 characters, but other errors are scarce and easily fixed, such as
  • 14 records with plant taxa mis-classified as animals
  • 4 records with dateIdentified earlier than eventDate
  • minor pseudo-duplication in several fields, e.g. "Anna Murray Vail; Elizabeth G. Britton" and "Anne Murray Vail; Elizabeth G. Britton" in recordedBy
  • minor content errors in some entries, e.g. "tissue frozen; tissue frozen" and "|" (with no other characters in the entry).
I doubt if it would take more than an hour to fix all the Peabody Museum issues besides the truncation one, which for an IPT dataset with 10.5M data items is outstanding. There are even fields in which the Museum has gone beyond what most data users would expect. Entries in vernacularName, for example, are semicolon-separated hierarchies of common names: "dwarf snapdragon; angiosperms; tracheophytes; plants" for Chaenorhinum minus.

The second IPT resource worth commending comes from GBIF Portugal and consists of 108 checklist, occurrence record and sampling event datasets. As with the Peabody resource, the datasets are consistently clean with only minor (and scattered) structural, format or content issues.

The problems appearing most often in these datasets are "double-encoding" errors with Portugese words and no-break spaces in place of plain spaces, and for both of these we can probably blame the use of Windows programs (like Excel) at the contributing institutions. An example of double-encoding: the Portugese "prôximo" is first encoded in UTF-8 as a 2-byte character, then read by a Windows program as two separate bytes, then converted back to UTF-8, resulting in the gibberish "prôximo". A large proportion of the no-break spaces in the Portugese datasets unfortunately occur in taxon name strings, which don't parse correctly and which GBIF won't taxon-match.

And the worst dataset? I've seen some pretty dreadful examples from around the world, but the UK's Natural History Museum sits at the top of my list of delinquent providers. The NHM offers several million records and a disappointingly high proportion of these have very serious data quality problems. These include invalid and inappropriate entries, disagreements between fields and missing-but-expected blanks.

Ironically, the NHM's data portal allows the visitor to select and examine/download records with any one of a number of GBIF issues, like "taxon_match_none". Further, for each record the data portal reports "GBIF quality indicators", as shown in this screenshot:

Clicking on that indicator box gives the portal visitor a list of the things that GBIF found wrong with the record (a list that overlaps incompletely with the list I can find with a data audit). I'm sure the NHM sees this facility differently, but to me it nicely demonstrates that NHM has prioritised Web development over data management. The message I get is
"We know there's a lot wrong with our data, but we're not going to fix anything. Instead, we're going to hand our mess as-is to any data users out there, with cleverly designed pointers to our many failures. Suck it up, people."
In isolation NHM might be seen as doing what it can with the resources it has. In a broader context the publication of multitudes of defective records by NHM is scandalous. Institutions with smaller budgets and fewer staff do a lot better with their data — see above.


If your institution is closed and you have spare work-from-home time, consider doing some data cleaning. For those not afraid of the command line, I've archived the websites A Data Cleaner's Cookbook (version 2) and its companion blog BASHing data (first 100 posts) in Zenodo with local links between the two, so that the two resources can be downloaded and used offline in any Web browser.

Tuesday, March 03, 2020

The 2020 Darwin Core Million

Bob mesibovThe following is a guest post by Bob Mesibov.

You're feeling pretty good about your institution's collections data. After carefully tucking all the data items into their correct Darwin Core fields, you uploaded the occurrence records to GBIF, the Atlas of Living Australia (ALA) or another aggregator, and you got back a great report:

  • all your scientific names were in the aggregator's taxonomic backbone
  • all your coordinates were in the countries you said they were
  • all your dates were OK (and in ISO 8601 format!)
  • all your recorders and identifiers were properly named
  • no key data items were missing

OK, ready for the next challenge for your data? Ready for the 2020 Darwin Core Million?

How it works

From the dataset you uploaded to the aggregator, select about one million data items. That could be, say, 50000 records in 20 populated Darwin Core fields, or 20000 records in 50 populated Darwin Core fields, or something in between. Send me the data for auditing before 31 March 2020 as a zipped plain-text file by email to, together with a DOI or other identifier for their online, aggregated presence.

I'll audit datasets in the order I receive them. If I can't any find data quality problems in your dataset, I'll pay your institution AUD$150 and declare your institution the winner of the 2020 Darwin Core Million here on iPhylo. (One winner only; datasets received after the first problem-free dataset won't be checked.)

If I find data quality problems, I'll let you know by email. If you want to learn what the problems are, I'll send you a report detailing what should be fixed and you'll pay me AUD$150. At 0.3-0.75c/record, that's a bargain compared to commercial data-checking rates. And it would be really good to hear, later on, that those problems had indeed been fixed and corrected data had been uploaded to the aggregator.

What I look for

For a list of data quality problems, see this page in my Data Cleaner's Cookbook. The key problems are:

  • duplicate records
  • invalid data items
  • data items in the wrong fields
  • data items inappropriate for their field
  • truncated data items
  • records with items in one field disagreeing with items in another
  • character encoding errors
  • wildly erroneous dates or coordinates
  • incorrect or inconsistent formatting of dates, names and other data
  • items

If you think some of this is just nit-picking, you're probably thinking of your data items as things for humans to read and interpret. But these are digital data items intended for parsing and managing by computers. "Western Hill" might not be the same as "Western Hill" in processing, for example, because the second item might have a no-break space between the words instead of a plain space. Another example: humans see these 22 variations on collector names as "the same", but computers don't.

You might also be thinking that data quality is all about data correctness. Is Western Hill really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? But it's possible to have entirely correct digital data that can't be processed by an application, or moved between applications, because the data suffer from one or more of the problems listed above.

I think my money is safe

The problems I look for are all easily found and fixed. However, as mentioned in a previous iPhylo post, the quality of the many institutional datasets that I've sample-audited ranges from mostly OK to pretty awful. I've also audited more than 100 datasets (many with multiple data tables) for Pensoft Publishers, and the occurrence records among them were never error-free. Some of those errors had vanished when the records had been uploaded to GBIF, because GBIF simply deleted the offending data items during processing (GBIF, bless 'em, also publish the original data items).

Neither institutions nor aggregators seem to treat occurrence records with the same regard for detail that you find in real scientific data, the kind that appear in tables in scientific journal articles. A comparison with enterprise data is even more discouraging. I'm not aware of any large museum or herbarium with a Curator of Data on the payroll, probably because no institution's income depends on the quality of the institution's data, and because collection records don't get audited the way company records do, for tax, insurance and good-governance purposes.

So there might be a winner this year, but I doubt it. Maybe next year. ALA has a year-long data quality project underway, and GBIF Executive Secretary Joe Miller (in litt.) says that GBIF is now paying closer attention to data quality. The 2021 Darwin Core Million prize could be yours...

Wednesday, January 08, 2020

ORCID serves linked data via content negotiation - who knew?

Just a note that ORCID serves data using terms from, and has done for a while (since April 2018), but somehow I missed this.

You can get linked data in JSON-LD using content negotiation. If we send a request to with "Accept: application/ld+json" we get back something like this:

{ "@context": "", "@type": "Person", "@id": "", "mainEntityOfPage": "", "givenName": "Mark", "familyName": "Hughes", "affiliation": { "@type": "Organization", "name": "Royal Botanic Garden Edinburgh", "identifier": { "@type": "PropertyValue", "propertyID": "RINGGOLD", "value": "41803" } }, "@reverse": {}, "url": [ "", "" ], "identifier": [ { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "845425" }, { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "826408" } ] }

This is the profile for Mark Hughes (0000-0002-2168-0514). Up until now I've been generating my own linked data version of ORCID records that look very similar to this, but going forward this will simplify life.

Note that I've truncated the above example, it's actually this:

{ "@context": "", "@type": "Person", "@id": "", "mainEntityOfPage": "", "givenName": "Mark", "familyName": "Hughes", "affiliation": { "@type": "Organization", "name": "Royal Botanic Garden Edinburgh", "identifier": { "@type": "PropertyValue", "propertyID": "RINGGOLD", "value": "41803" } }, "@reverse": { "creator": [ { "@type": "CreativeWork", "@id": "", "name": "BEGONIA MAGUNIANA (BEGONIACEAE, BEGONIA SECT. OLIGANDRAE), A NEW SPECIES FROM NEW GUINEA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428619000283" } }, { "@type": "CreativeWork", "@id": "", "name": "A revision of Begonia sect. Petermannia on Sumatra, Indonesia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.407.1.11" } }, { "@type": "CreativeWork", "@id": "", "name": "Two new species of Begonia (Begoniaceae) from Borneo", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.407.1.4" } }, { "@type": "CreativeWork", "@id": "", "name": "AN UPDATED CHECKLIST AND A NEW SPECIES OF BEGONIA (B. RHEOPHYTICA) FROM MYANMAR", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428619000052" } }, { "@type": "CreativeWork", "@id": "", "name": "Chloroplast and nuclear DNA exchanges among Begonia sect. Baryandra species (Begoniaceae) from Palawan Island, Philippines, and descriptions of five new species.", "identifier": [ { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1371/journal.pone.0194877" }, { "@type": "PropertyValue", "propertyID": "pmc", "value": "PMC5931476" }, { "@type": "PropertyValue", "propertyID": "pmid", "value": "29718922" } ], "sameAs": [ "", "" ] }, { "@type": "CreativeWork", "@id": "", "name": "TWO NEW SPECIES OF BEGONIA FROM SUMATRA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428618000136" } }, { "@type": "CreativeWork", "@id": "", "name": "A REVISION OF BEGONIA SECT. SYMBEGONIA ON NEW GUINEA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s096042861800001x" } }, { "@type": "CreativeWork", "@id": "", "name": "A revision and one new species of Begonia L. (Begoniaceae, Cucurbitales) in Northeast India", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2018.396" } }, { "@type": "CreativeWork", "@id": "", "name": "Pliocene intercontinental dispersal from Africa to Southeast Asia highlighted by the new species Begonia afromigrata (Begoniaceae)", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1002/tax.606013" } }, { "@type": "CreativeWork", "@id": "", "name": "Taxonomic notes on the Philippine endemic Begonia colorata (Begoniaceae, section Petermannia)", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.381.1.16" } }, { "@type": "CreativeWork", "@id": "", "name": "Three new species of Begonia sect. Baryandra from Panay Island, Philippines.", "identifier": [ { "@type": "PropertyValue", "propertyID": "pmid", "value": "28664395" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/s40529-017-0182-x" }, { "@type": "PropertyValue", "propertyID": "pmc", "value": "PMC5491425" } ], "sameAs": [ "", "" ] }, { "@type": "CreativeWork", "@id": "", "name": "A new species of Begonia section Parvibegonia (Begoniaceae) from Thailand and Myanmar", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3767/000651917x695083" } }, { "@type": "CreativeWork", "@id": "", "name": "TAXONOMY OF BEGONIA ALBOMACULATA AND DESCRIPTION OF TWO NEW SPECIES ENDEMIC TO PERU", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428617000075" } }, { "@type": "CreativeWork", "@id": "", "name": "FOUR NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM THAILAND", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428617000051" } }, { "@type": "CreativeWork", "@id": "", "name": "A new species and a new record in Begonia sect. Platycentrum (Begoniaceae) from Thailand", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3850/s2382581216000077" } }, { "@type": "CreativeWork", "@id": "", "name": "Begonia yapenensis (sect. Symbegonia, Begoniaceae), a new species from Papua, Indonesia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2015.119" } }, { "@type": "CreativeWork", "@id": "", "name": "A new section (Begonia sect. Oligandrae sect. nov.) and a new species (Begonia pentandra sp. nov.) in Begoniaceae from New Guinea", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.197.1.4" } }, { "@type": "CreativeWork", "@id": "", "name": "Further discoveries in the ever-expanding genus Begonia (Begoniaceae): fifteen new species from Sumatra", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2015.167" } }, { "@type": "CreativeWork", "@id": "", "name": "Three New Species of Begonia Endemic to the Puerto Princesa Subterranean River National Park, Palawan", "identifier": [ { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201507-201507290029-201507290029-c1-14" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/s40529-015-0099-1" } ], "sameAs": "" }, { "@type": "CreativeWork", "@id": "", "name": "Memecylon pseudomegacarpum M.Hughes (Melastomataceae), a new species of tree from Peninsular Malaysia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2013.56" } }, { "@type": "CreativeWork", "@id": "", "name": "Recircumscription of Begonia sect. Baryandra (Begoniaceae): evidence from molecular data", "identifier": [ { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201309-201401170003-201401170003-70-74" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/1999-3110-54-38" } ], "sameAs": "" }, { "@type": "CreativeWork", "@id": "", "name": "A new species and new combinations of Memecylon in Thailand and Peninsular Malaysia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.66.1.2" } }, { "@type": "CreativeWork", "@id": "", "name": "A NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM PENINSULAR THAILAND", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428612000078" } }, { "@type": "CreativeWork", "@id": "", "name": "Pollen morphology of Begonia L. (Begoniaceae) in Nepal", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3329/bjpt.v19i2.13134" } }, { "@type": "CreativeWork", "@id": "", "name": "West to east dispersal and subsequent rapid diversification of the mega-diverse genus Begonia (Begoniaceae) in the Malesian archipelago", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1111/j.1365-2699.2011.02596.x" } }, { "@type": "CreativeWork", "@id": "", "name": "NINE NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SOUTH AND WEST SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428611000072" } }, { "@type": "CreativeWork", "name": "Begonia blancii (sect. Diploclinium, Begoniaceae), A New Species Endemic to the Philippine Island of Palawan", "identifier": { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201104-201106150032-201106150032-203-209" }, "sameAs": "" }, { "@type": "CreativeWork", "@id": "", "name": "BEGONIA SECTION PETERMANNIA (BEGONIACEAE) ON PALAWAN (PHILIPPINES), INCLUDING TWO NEW SPECIES", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005307" } }, { "@type": "CreativeWork", "@id": "", "name": "TWO NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SOUTH SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005484" } }, { "@type": "CreativeWork", "@id": "", "name": "TWO NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM CENTRAL SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005320" } }, { "@type": "CreativeWork", "@id": "", "name": "BEGONIA VARIPELTATA(BEGONIACEAE): A NEW PELTATE SPECIES FROM SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s096042860800509x" } }, { "@type": "CreativeWork", "@id": "", "name": "BEGONIA CLADOTRICHA (BEGONIACEAE): A NEW SPECIES FROM LAOS", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428607000777" } }, { "@type": "CreativeWork", "@id": "", "name": "FOUR NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SULAWESI", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428606000588" } }, { "@type": "CreativeWork", "@id": "", "name": "A Phylogeny of Begonia Using Nuclear Ribosomal Sequence Data and Morphological Characters", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1600/0363644054782297" } }, { "@type": "CreativeWork", "@id": "", "name": "Population genetic structure in the endemic Begonia of the Socotra archipelago", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1016/s0006-3207(02)00375-0" } }, { "@type": "CreativeWork", "@id": "", "name": "A NEW ENDEMIC SPECIES OF BEGONIA (BEGONIACEAE) FROM THE SOCOTRA ARCHIPELAGO", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428602000082" } }, { "@type": "CreativeWork", "@id": "", "name": "Isolation of polymorphic microsatellite markers for Begonia sutherlandii Hook. f.", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1046/j.1471-8286.2002.00201.x" } }, { "@type": "CreativeWork", "@id": "", "name": "Polymorphic microsatellite markers for the Socotran endemic herb Begonia socotrana", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1046/j.1471-8286.2002.00185.x" } }, { "@type": "CreativeWork", "@id": "", "name": "Distribution Patterns of Begonia species in the Nepal Himalaya", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3126/botor.v7i0.4386" } } ] }, "url": [ "", "" ], "identifier": [ { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "845425" }, { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "826408" } ] }

This gives us a list of Mark's publications from ORCID. If there aren't any publications listed, the @reverse property is empty. Note that @reverse is a JSON-LD trick that enables the JSON-LD document to include not only things linked from Mark's ORCID id (e.g., his name and affiliation) but also things his ORCID id is linked to (e.g., that he is the author of works such as

I will still be generating my own linked data from ORCID for now as I rely on knowing the order of authorship for some of my work (e.g., "Reconciling author names in taxonomic and publication databases", and I want to be able to further process ORCID data (e.g., looking for missing DOIs), but the fact that ORCID are making JSON-LD available is going to simplify a lot of data integration tasks in the future.

Friday, December 20, 2019

GBIF metagenomics and metacrap

Yes, this is a clickbait headline, and yes, it may seem like shooting fish in a barrel to complain about crappy data in GBIF, but my point here is raise concerns about the impact of metagenomic data on GBIF, and how difficult it may be to track down the causes of errors.

I stumbled across this example while looking for specimen records for the genus Rafflesia, which are parasitic plants famous for the spectacular size of their flowers (up to 1m across).

The GBIF map for Rafflesia shows a few outliers. Unfortunately GBIF doesn't make it easy to drill down (why oh why can't we just click on the map and see the corresponding occurrences?) so I opened the map in iSpecies and clicked on each outlier in turn. The one in Vanuatu (438164267 from the Paris museum P00577336) is identified to genus level only and has the note:
Parasite terrestre, grande fleur orange au ras du sol. Incomplète suite à prédation. Très forte odeur désagréable. Récolté par Sylvain Hugel (photo) (alcool seul)
which Google translates as:
Terrestrial parasite, large orange flower at ground level. Incomplete due to predation. Very strong unpleasant odor. Collected by Sylvain Hugel (photo) (alcohol only)
Sounds a bit like Rafflesia but there's no photo or other information available online. Likewise there's no additional data for the record from Brazil (1090499968). There is a record from Madagascar that is accompanied by a photo, but it that looks nothing like Rafflesia (1261055923):

That leaves two records, 2018528337 (Rafflesia cantleyi) from off the coast of Peru, and 2014813273 (Rafflesia) off the coast of Australia. Both of these records are metagenomic. For example, occurrence 2018528337 is part of a dataset Amplicon sequencing of Tara Oceans DNA samples corresponding to size fractions for protists that, on the face of it, would be an unlikely source of occurrences of forest dwelling plants.

What we get in the GBIF occurrence record is a link to the pipeline used to generate the data (Pipeline version 4.1 - 17-Jan-2018), the sample (ERS491947), and an analysis (MGYA00167469) that summarises all the taxonomic data from the ocean water sample.

What we don't get in GBIF is an obvious way to try and figure out why GBIF thinks that large flowers live in the ocean. I followed the links from MGYA00167469 and downloaded a bunch of files, some in familiar formats (FASTA), others in formats I'd not seen before (e.g., HDF5). From the mapseq file we have the following line:

ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1 GFBU01000011.4303.6106 76 0.9634146094322205 79 3 0 6 88 1722 1804 +  sk__Eukaryota;k__Viridiplantae;p__Streptophyta;c__;o__Malpighiales;f__Rafflesiaceae;g__Rafflesia;s__Rafflesia_cantleyi 

This tells us that sequence ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1 matches GenBank sequence GFBU01000011, which is "Rafflesia cantleyi RC_11 transcribed RNA sequence" from a paper on flower development in Rafflesia cantleyi doi:10.1371/journal.pone.0167958. So, now we see why we think we have giant flowers off the coast of Peru.

The rest of the line has information on the match: the oceanic sequence has a 0.96 identity with the plant sequence, has 79 matches, 3 mismatches, and no gaps, which suggests that this is a short sequence. Going digging in the FASTA file I found the raw sequence, and it is indeed very short:


This short string is the evidence for Rafflesia in the ocean. Out of curiosity I ran this sequence through BLAST:

             |||||||||||||| |||||||||||||||| ||||||||||||||| ||||||||||||


The top hit is Bathycoccus prasinos, a picoplanktonic alga with a world-wide distribution. This seems like a more plausible identification for this sequence (all the top 100 hits are very similar to each other, many are labelled as Bathycoccus).

So, there's something clearly amiss with the analysis of this dataset. Someone who knows more about metagenomics than I do will be better placed to explain why this pipeline got it so wrong, and how common this issue is.

Given the scale and automation of metagenomics, there will always be errors - that is inevitable. What we need is ways to catch those errors, especially ones that are going to "pollute" existing distribution data with spurious records (flowers in the ocean). And in a sense, GBIF excels at this in that it exposes data to a wider audience. If you work on marine microbiology you might not notice that your sequences apparently include forest plants, but if you work on forest plants you will almost certainly notice sequences occurring in the ocean.

A key feature of GBIF that makes it so useful is that, unlike many data repositories, it does not treat data as a "black box". GBIF is not like a library catalogue which merely tells you that they have books and where to find them, instead it is like Google Books, which can tell you which books contain a given phrase you are looking for. By opening up each dataset and indexing the contents, GBIF enables us to do data analysis (in much the same way that GenBank isn't just a catalogue of sequences, it enables you to search the sequences themselves).

This is a feature we risk losing if we treat metagenomics data as a black box. The Tara Oceans data that GBIF receives is simply a list of taxa at a locality, it's a checklist. We have to take it on trust that the taxonomic assignments are accurate, and it is not a trivial task to diagnose errors. Compare this to having the photo that accompanied the record from Madagascar, which helps us determine that the identification is wrong. Going forward, it would be helpful if we had metagenomic sequences available as part of the data download from GBIF. It's also worth considering whether GBIF should start doing its own analysis of sequence data, or asking its contributors to check that their taxonomic assignments are correct (e.g., running BLAST on the data). Otherwise GBIF users may end up having to filter their data for a growing number of completely spurious records.


Looks like (occurrence 1261055923) is Langsdorffia: