Wednesday, January 08, 2020

ORCID serves schema.org linked data via content negotiation - who knew?

Just a note that ORCID serves data using terms from schema.org, and has done for a while (since April 2018), but somehow I missed this.

You can get linked data in JSON-LD using content negotiation. If we send a request to https://orcid.org/0000-0002-2168-0514 with "Accept: application/ld+json" we get back something like this:

{ "@context": "http://schema.org", "@type": "Person", "@id": "https://orcid.org/0000-0002-2168-0514", "mainEntityOfPage": "https://orcid.org/0000-0002-2168-0514", "givenName": "Mark", "familyName": "Hughes", "affiliation": { "@type": "Organization", "name": "Royal Botanic Garden Edinburgh", "identifier": { "@type": "PropertyValue", "propertyID": "RINGGOLD", "value": "41803" } }, "@reverse": {}, "url": [ "https://www.rbge.org.uk/about-us/organisational-structure/staff/tropical-diversity/dr-mark-hughes/", "https://www.mendeley.com/profiles/mark-hughes/" ], "identifier": [ { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "845425" }, { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "826408" } ] }

This is the profile for Mark Hughes (0000-0002-2168-0514). Up until now I've been generating my own linked data version of ORCID records that look very similar to this, but going forward this will simplify life.

Note that I've truncated the above example, it's actually this:

{ "@context": "http://schema.org", "@type": "Person", "@id": "https://orcid.org/0000-0002-2168-0514", "mainEntityOfPage": "https://orcid.org/0000-0002-2168-0514", "givenName": "Mark", "familyName": "Hughes", "affiliation": { "@type": "Organization", "name": "Royal Botanic Garden Edinburgh", "identifier": { "@type": "PropertyValue", "propertyID": "RINGGOLD", "value": "41803" } }, "@reverse": { "creator": [ { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428619000283", "name": "BEGONIA MAGUNIANA (BEGONIACEAE, BEGONIA SECT. OLIGANDRAE), A NEW SPECIES FROM NEW GUINEA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428619000283" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.11646/phytotaxa.407.1.11", "name": "A revision of Begonia sect. Petermannia on Sumatra, Indonesia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.407.1.11" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.11646/phytotaxa.407.1.4", "name": "Two new species of Begonia (Begoniaceae) from Borneo", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.407.1.4" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428619000052", "name": "AN UPDATED CHECKLIST AND A NEW SPECIES OF BEGONIA (B. RHEOPHYTICA) FROM MYANMAR", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428619000052" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1371/journal.pone.0194877", "name": "Chloroplast and nuclear DNA exchanges among Begonia sect. Baryandra species (Begoniaceae) from Palawan Island, Philippines, and descriptions of five new species.", "identifier": [ { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1371/journal.pone.0194877" }, { "@type": "PropertyValue", "propertyID": "pmc", "value": "PMC5931476" }, { "@type": "PropertyValue", "propertyID": "pmid", "value": "29718922" } ], "sameAs": [ "https://europepmc.org/articles/PMC5931476", "https://www.ncbi.nlm.nih.gov/pubmed/29718922" ] }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428618000136", "name": "TWO NEW SPECIES OF BEGONIA FROM SUMATRA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428618000136" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s096042861800001x", "name": "A REVISION OF BEGONIA SECT. SYMBEGONIA ON NEW GUINEA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s096042861800001x" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.5852/ejt.2018.396", "name": "A revision and one new species of Begonia L. (Begoniaceae, Cucurbitales) in Northeast India", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2018.396" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1002/tax.606013", "name": "Pliocene intercontinental dispersal from Africa to Southeast Asia highlighted by the new species Begonia afromigrata (Begoniaceae)", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1002/tax.606013" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.11646/phytotaxa.381.1.16", "name": "Taxonomic notes on the Philippine endemic Begonia colorata (Begoniaceae, section Petermannia)", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.381.1.16" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1186/s40529-017-0182-x", "name": "Three new species of Begonia sect. Baryandra from Panay Island, Philippines.", "identifier": [ { "@type": "PropertyValue", "propertyID": "pmid", "value": "28664395" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/s40529-017-0182-x" }, { "@type": "PropertyValue", "propertyID": "pmc", "value": "PMC5491425" } ], "sameAs": [ "https://www.ncbi.nlm.nih.gov/pubmed/28664395", "https://europepmc.org/articles/PMC5491425" ] }, { "@type": "CreativeWork", "@id": "https://doi.org/10.3767/000651917x695083", "name": "A new species of Begonia section Parvibegonia (Begoniaceae) from Thailand and Myanmar", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3767/000651917x695083" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428617000075", "name": "TAXONOMY OF BEGONIA ALBOMACULATA AND DESCRIPTION OF TWO NEW SPECIES ENDEMIC TO PERU", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428617000075" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428617000051", "name": "FOUR NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM THAILAND", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428617000051" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.3850/s2382581216000077", "name": "A new species and a new record in Begonia sect. Platycentrum (Begoniaceae) from Thailand", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3850/s2382581216000077" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.5852/ejt.2015.119", "name": "Begonia yapenensis (sect. Symbegonia, Begoniaceae), a new species from Papua, Indonesia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2015.119" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.11646/phytotaxa.197.1.4", "name": "A new section (Begonia sect. Oligandrae sect. nov.) and a new species (Begonia pentandra sp. nov.) in Begoniaceae from New Guinea", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.197.1.4" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.5852/ejt.2015.167", "name": "Further discoveries in the ever-expanding genus Begonia (Begoniaceae): fifteen new species from Sumatra", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2015.167" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1186/s40529-015-0099-1", "name": "Three New Species of Begonia Endemic to the Puerto Princesa Subterranean River National Park, Palawan", "identifier": [ { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201507-201507290029-201507290029-c1-14" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/s40529-015-0099-1" } ], "sameAs": "http://www.airitilibrary.com/Publication/alDetailedMesh?DocID=1817406X-201507-201507290029-201507290029-c1-14" }, { "@type": "CreativeWork", "@id": "https://doi.org/10.5852/ejt.2013.56", "name": "Memecylon pseudomegacarpum M.Hughes (Melastomataceae), a new species of tree from Peninsular Malaysia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2013.56" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1186/1999-3110-54-38", "name": "Recircumscription of Begonia sect. Baryandra (Begoniaceae): evidence from molecular data", "identifier": [ { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201309-201401170003-201401170003-70-74" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/1999-3110-54-38" } ], "sameAs": "http://www.airitilibrary.com/Publication/alDetailedMesh?DocID=1817406X-201309-201401170003-201401170003-70-74" }, { "@type": "CreativeWork", "@id": "https://doi.org/10.11646/phytotaxa.66.1.2", "name": "A new species and new combinations of Memecylon in Thailand and Peninsular Malaysia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.66.1.2" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428612000078", "name": "A NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM PENINSULAR THAILAND", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428612000078" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.3329/bjpt.v19i2.13134", "name": "Pollen morphology of Begonia L. (Begoniaceae) in Nepal", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3329/bjpt.v19i2.13134" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1111/j.1365-2699.2011.02596.x", "name": "West to east dispersal and subsequent rapid diversification of the mega-diverse genus Begonia (Begoniaceae) in the Malesian archipelago", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1111/j.1365-2699.2011.02596.x" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428611000072", "name": "NINE NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SOUTH AND WEST SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428611000072" } }, { "@type": "CreativeWork", "name": "Begonia blancii (sect. Diploclinium, Begoniaceae), A New Species Endemic to the Philippine Island of Palawan", "identifier": { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201104-201106150032-201106150032-203-209" }, "sameAs": "http://www.airitilibrary.com/Publication/alDetailedMesh?DocID=1817406X-201104-201106150032-201106150032-203-209" }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428609005307", "name": "BEGONIA SECTION PETERMANNIA (BEGONIACEAE) ON PALAWAN (PHILIPPINES), INCLUDING TWO NEW SPECIES", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005307" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428609005484", "name": "TWO NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SOUTH SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005484" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428609005320", "name": "TWO NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM CENTRAL SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005320" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s096042860800509x", "name": "BEGONIA VARIPELTATA(BEGONIACEAE): A NEW PELTATE SPECIES FROM SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s096042860800509x" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428607000777", "name": "BEGONIA CLADOTRICHA (BEGONIACEAE): A NEW SPECIES FROM LAOS", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428607000777" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428606000588", "name": "FOUR NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SULAWESI", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428606000588" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1600/0363644054782297", "name": "A Phylogeny of Begonia Using Nuclear Ribosomal Sequence Data and Morphological Characters", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1600/0363644054782297" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1016/s0006-3207(02)00375-0", "name": "Population genetic structure in the endemic Begonia of the Socotra archipelago", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1016/s0006-3207(02)00375-0" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1017/s0960428602000082", "name": "A NEW ENDEMIC SPECIES OF BEGONIA (BEGONIACEAE) FROM THE SOCOTRA ARCHIPELAGO", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428602000082" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1046/j.1471-8286.2002.00201.x", "name": "Isolation of polymorphic microsatellite markers for Begonia sutherlandii Hook. f.", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1046/j.1471-8286.2002.00201.x" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.1046/j.1471-8286.2002.00185.x", "name": "Polymorphic microsatellite markers for the Socotran endemic herb Begonia socotrana", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1046/j.1471-8286.2002.00185.x" } }, { "@type": "CreativeWork", "@id": "https://doi.org/10.3126/botor.v7i0.4386", "name": "Distribution Patterns of Begonia species in the Nepal Himalaya", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3126/botor.v7i0.4386" } } ] }, "url": [ "https://www.rbge.org.uk/about-us/organisational-structure/staff/tropical-diversity/dr-mark-hughes/", "https://www.mendeley.com/profiles/mark-hughes/" ], "identifier": [ { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "845425" }, { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "826408" } ] }

This gives us a list of Mark's publications from ORCID. If there aren't any publications listed, the @reverse property is empty. Note that @reverse is a JSON-LD trick that enables the JSON-LD document to include not only things linked from Mark's ORCID id (e.g., his name and affiliation) but also things his ORCID id is linked to (e.g., that he is the author of works such as https://doi.org/10.1017/s0960428619000283).

I will still be generating my own linked data from ORCID for now as I rely on knowing the order of authorship for some of my work (e.g., "Reconciling author names in taxonomic and publication databases" https://doi.org/10.1101/870170), and I want to be able to further process ORCID data (e.g., looking for missing DOIs), but the fact that ORCID are making JSON-LD available is going to simplify a lot of data integration tasks in the future.

Friday, December 20, 2019

GBIF metagenomics and metacrap

Yes, this is a clickbait headline, and yes, it may seem like shooting fish in a barrel to complain about crappy data in GBIF, but my point here is raise concerns about the impact of metagenomic data on GBIF, and how difficult it may be to track down the causes of errors.

I stumbled across this example while looking for specimen records for the genus Rafflesia, which are parasitic plants famous for the spectacular size of their flowers (up to 1m across).



The GBIF map for Rafflesia shows a few outliers. Unfortunately GBIF doesn't make it easy to drill down (why oh why can't we just click on the map and see the corresponding occurrences?) so I opened the map in iSpecies and clicked on each outlier in turn. The one in Vanuatu (438164267 from the Paris museum P00577336) is identified to genus level only and has the note:
Parasite terrestre, grande fleur orange au ras du sol. Incomplète suite à prédation. Très forte odeur désagréable. Récolté par Sylvain Hugel (photo) (alcool seul)
which Google translates as:
Terrestrial parasite, large orange flower at ground level. Incomplete due to predation. Very strong unpleasant odor. Collected by Sylvain Hugel (photo) (alcohol only)
Sounds a bit like Rafflesia but there's no photo or other information available online. Likewise there's no additional data for the record from Brazil (1090499968). There is a record from Madagascar that is accompanied by a photo, but it that looks nothing like Rafflesia (1261055923):


That leaves two records, 2018528337 (Rafflesia cantleyi) from off the coast of Peru, and 2014813273 (Rafflesia) off the coast of Australia. Both of these records are metagenomic. For example, occurrence 2018528337 is part of a dataset Amplicon sequencing of Tara Oceans DNA samples corresponding to size fractions for protists that, on the face of it, would be an unlikely source of occurrences of forest dwelling plants.

What we get in the GBIF occurrence record is a link to the pipeline used to generate the data (Pipeline version 4.1 - 17-Jan-2018), the sample (ERS491947), and an analysis (MGYA00167469) that summarises all the taxonomic data from the ocean water sample.



What we don't get in GBIF is an obvious way to try and figure out why GBIF thinks that large flowers live in the ocean. I followed the links from MGYA00167469 and downloaded a bunch of files, some in familiar formats (FASTA), others in formats I'd not seen before (e.g., HDF5). From the mapseq file we have the following line:

ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1 GFBU01000011.4303.6106 76 0.9634146094322205 79 3 0 6 88 1722 1804 +  sk__Eukaryota;k__Viridiplantae;p__Streptophyta;c__;o__Malpighiales;f__Rafflesiaceae;g__Rafflesia;s__Rafflesia_cantleyi 

This tells us that sequence ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1 matches GenBank sequence GFBU01000011, which is "Rafflesia cantleyi RC_11 transcribed RNA sequence" from a paper on flower development in Rafflesia cantleyi doi:10.1371/journal.pone.0167958. So, now we see why we think we have giant flowers off the coast of Peru.

The rest of the line has information on the match: the oceanic sequence has a 0.96 identity with the plant sequence, has 79 matches, 3 mismatches, and no gaps, which suggests that this is a short sequence. Going digging in the FASTA file I found the raw sequence, and it is indeed very short:

>ERR562574.2494029-BISMUTH-0000:2:112:3465:9749-2/89-1
GTCTAAGTGTCGTGAGAAGTTCGTTGAACCTGATCATTTAGAGGAAGTAGAAGTCGTAAC
AAGGTTTCCGTAGGTGAACCTGCGGAAGG


This short string is the evidence for Rafflesia in the ocean. Out of curiosity I ran this sequence through BLAST:

Query  1     GTCTAAGTGTCGTGAGAAGTTCGTTGAACCTGATCATTTAGAGGAAGTAGAAGTCGTAAC  60
             |||||||||||||| |||||||||||||||| ||||||||||||||| ||||||||||||
Sbjct  1683  GTCTAAGTGTCGTGGGAAGTTCGTTGAACCTTATCATTTAGAGGAAGGAGAAGTCGTAAC  1742

Query  61    AAGGTTTCCGTAGGTGAACCTGCGGAAGG  89
             |||||||||||||||||||||||||||||
Sbjct  1743  AAGGTTTCCGTAGGTGAACCTGCGGAAGG  1771

The top hit is Bathycoccus prasinos, a picoplanktonic alga with a world-wide distribution. This seems like a more plausible identification for this sequence (all the top 100 hits are very similar to each other, many are labelled as Bathycoccus).



So, there's something clearly amiss with the analysis of this dataset. Someone who knows more about metagenomics than I do will be better placed to explain why this pipeline got it so wrong, and how common this issue is.

Given the scale and automation of metagenomics, there will always be errors - that is inevitable. What we need is ways to catch those errors, especially ones that are going to "pollute" existing distribution data with spurious records (flowers in the ocean). And in a sense, GBIF excels at this in that it exposes data to a wider audience. If you work on marine microbiology you might not notice that your sequences apparently include forest plants, but if you work on forest plants you will almost certainly notice sequences occurring in the ocean.

A key feature of GBIF that makes it so useful is that, unlike many data repositories, it does not treat data as a "black box". GBIF is not like a library catalogue which merely tells you that they have books and where to find them, instead it is like Google Books, which can tell you which books contain a given phrase you are looking for. By opening up each dataset and indexing the contents, GBIF enables us to do data analysis (in much the same way that GenBank isn't just a catalogue of sequences, it enables you to search the sequences themselves).

This is a feature we risk losing if we treat metagenomics data as a black box. The Tara Oceans data that GBIF receives is simply a list of taxa at a locality, it's a checklist. We have to take it on trust that the taxonomic assignments are accurate, and it is not a trivial task to diagnose errors. Compare this to having the photo that accompanied the record from Madagascar, which helps us determine that the identification is wrong. Going forward, it would be helpful if we had metagenomic sequences available as part of the data download from GBIF. It's also worth considering whether GBIF should start doing its own analysis of sequence data, or asking its contributors to check that their taxonomic assignments are correct (e.g., running BLAST on the data). Otherwise GBIF users may end up having to filter their data for a growing number of completely spurious records.

Update

Looks like (occurrence 1261055923) is Langsdorffia:


Friday, December 13, 2019

The Semantic Web revisited: thoughts on SWAT4HCLS


This week I attended the SWAT4(HC)LS (Semantic Web Applications and Tools for Healthcare and Life Sciences) meeting in Edinburgh. Although a relatively small meeting, SWAT4(HC)LS attracts some big names in the field and featured keynotes by Denny Vrandečić (founder of Wikidata), Dov Greenbaum, Birgitta König-Ries, and Helen Parkinson.
For me this was a chance to get a sense of the state of the Semantic Web, and also to present a talk on biodiversity knowledge graphs. Given that this is a computer science meeting, you need to get a paper submitted and accepted in order to give a talk, so I hastily wrote up some notes on matching author names in taxonomic and bibliographic databases (there's a version of this on bioRxiv):
Page, R. D. M. (2019). Reconciling author names in taxonomic and publication databases. doi:10.1101/870170
Google the "Semantic Web" and pretty soon you discover that many people think it is dead (see Whatever Happened to the Semantic Web?). But it is still here, maybe partly because there is some ambiguity about just what it is. The 2003 paper "Which semantic web?" By Catherine C. Marshall and Frank M. Shipman (doi:10.1145/900051.900063) sketches three different Semantic Webs:


  1. a universal library, to be readily accessed and used by humans in a variety of information use contexts.
  2. the backdrop for the work of computational agents completing sophisticated activities on behalf of their human counterparts
  3. a method for federating particular knowledge bases and databases to perform
(1) is essentially what Google gives us, the ability to use a web browser to find stuff on the web, augmented by structured markup to help us do that (the "Library of Alexandria"). (2) is the idea of global ontologies, agents, and reasoning (the Knowledge Navigator), and (3) focusses on cross linking data in different databases (the "Federated Knowledge Base").

My own focus is very much in area (3), I want to link disconnected datasets together. Many of the presentations at SWAT4(HC)LS were more in area (2) and focussed on ontologies, especially medical. This is a world of big - not always open - ontologies, and lots of discussions about how to model data. In other words, what many people think of as the Semantic Web.

One of the nice things about the conference was the way people with posters got to give a lightning talk about their poster (I've seen this at VIZBI as well). I think this is a great idea and would love to see this at biodiversity conferences. The posters that I got the most out of were from the researchers at the DBCLS in Japan, such as TogoStanza (visualisations of SPARQL results), SPARQList (Markdown notebook for SPARQL), and Umaka Viewer (visualise classes in a SPARQL endpoint).

For fun I tried Umaka Viewer on my Ozymandias knowledge graph. You can see the results here.
It took about 30 minutes to generate the data for this visualisation, but it was fun to poke around at the internals of a knowledge graph that I had created. I discovered classes I'd forgotten I'd used!


As someone who spends a lot of time messing about with ways to collect, clean, and visualise data, it's no surprise that posters and presentations on tools for doing this are what I found most useful. The thing I find most appealing about the Semantic Web is the notion of having simple APIs that can query knowledge encoded in both web pages and databases (see also work by Franck Michel and colleagues on SPARQL Micro-Services, e.g. SPARQL Micro-Services Demo Page).

Tuesday, November 05, 2019

Thoughts on Biodiversity Next

It’s been a while since I’ve posted on iPhylo. Since returning from a fun and productive time in Australia there have been a bunch of professional and personal things that have needed attending too. In amongst all this I attended Biodiversity Next in Leiden, a large (by biodiversity informatics standards) conference with the tag line "Building a global infrastructure for biodiversity data. Together." In this post I try and bring together a few random thoughts on the conference (the Twitter hashtag #biodiversitynext gives a much broader sense of the conference).

Spectacle


The venue for the keynotes was delightful, and guest speakers were ushered on stage with a rock-star soundtrack (which, frankly, grated at bit). Some of the keynotes were essentially TED talks, such as Theo Jansen on his wonderful Strandbeest and Jalila Essaidi on bulletproof skin and other biotechnology. Interesting, polished, hopeful.



Some keynotes were pitches, such as Paul Hebert’s BIOSCAN where we divide the planet into a grid (of squares, really?) and sequence barcodes for everything within each grid. The theme was moving from “artisanal to industrial” scale. BIOSCAN has a rival, the Earth BioGenome Project (EBP) (see https://doi.org/10.1073/pnas.1720115115) which aims to sequence the whole genome of every eukaryote in 10 years (at a cost of $US 4.7 billion). BIOSCAN is a rather cheaper, although Herbert sees it as the precursor to a larger initiative. But what makes BioSCAN more appealing to me is that it includes and explicit geographical and ecological context. BIOSCAN is interested in what species occur where, and who they are interacting with (the “symbiome”). But not everyone is convinced that mega-genomics projects are a good idea, see for example Proposals to “sequence the DNA of all life on Earth” suffer from the same issues as “naming all the species” by @JeffOllerton.

Other keynotes that resonated with me were Maxwell Gomera’s where he points out that for many people biodiversity is a risk (including an anecdote about people in Namibia seeing biodiversity as attracting the unwanted attention of outside interests, and hence something to be actively minimised), and Jorge Soberon on just how much of biodiversity informatics is data driven and theory-free. The presentation by Ana María Hernández Salgar on IPBES was perhaps the least exciting keynote, arguably because she’s tackling a probably intractable problem. We have some spectacular technology for documenting and understanding biodiversity, but no obvious way to change or significantly influence human impacts on that biodiversity.

Optics


The conference managed to score a pretty spectacular own goal by having an all-white, all male panel (“manel”) for one session (moderated by Ely Wallis @elyw).



There was a pointed response to this later in the conference (again moderated by Ely).



Personally I felt that neither panel contributed much beyond platitudes. I don’t think panel discussions like these do much to explore ideas, they are much more about appearances and positions (which makes the manel even more unfortunate).

There were other comments that were tone deaf. One senior figure argued that “money wasn’t a problem”, the implication being that there’s lots of it around, we just have to figure out how to access it. Yet, one of the sessions I attended featured a young researcher from Brazil who had to crowdfund his attendance at the conference. Money (or rather its uneven distribution) is very much a problem.

Infrastructure


The conference had its own app, and it worked well. It certainly made it easier to plan the day, which sadly was mostly realising that the two topics that you were out interested in hearing about were on at the same time. Big conferences have this fundamental problem that there are too many people and too many talks for people to see everything. This makes the event more a statement about community being large enough to stage such an event, than actually being a place to learn what is going on. But I guess the combination of breaks between sessions, social events, and the pre-conference workshops mean there are times and spaces where things can actually get done.

Substance


There was a lot going on at the conference, I am going to pick out just a few highlights for me. These are obviously very biased, and I missed a lot of the talks.

Cordra

The thing I was most interested to learn about was the technology underpinning DISSCO’s approach to putting specimen records online. Alex Hardisty (@AlexHardisty) gave a nice demo of DISSCO’s approach, which uses Cordra. From Cordra’s website:
Cordra is a highly configurable open source software offered to software developers for managing digital objects with resolvable identifiers at scale.
Cordra is from the Corporation for National Research Initiatives (CNRI), the people behind the Handle system which underpins DOIs. It's a NoSQL data store that can generate and manage persistent identifiers (e.g., Handles). I’ve not been following DISSCO closely, but this approach makes a lot of sense, and it will be interesting to see how it develops. Alex demoed a “digital specimen repository”, for example the record for specimen BMNH:2006.12.6.40-41 is here: http://nsidr.org/#objects/20.5000.1025/486a7e883f14f88bba37. Early days, but digitial identifiers for specimens are going to be crucial to efforts to interlink biodiversity data.

Knowledge graphs

I did my best to spread the knowledge graph meme, and Wikidata is attracting growing interest. Unfortunately I couldn’t see Franck Michel’s (@franck_michel2) talk on Bioschemas, but the idea of having light-weight markup for life science data is very attractive. It seems that long-standing dreams of linking things together are starting to slowly take shape.


Traits

This is an area that I have not thought much about. The Encyclopaedia of Life tried to carve out a niche in this area (TraitBank) but their latest iteration abandons the JSON-LD they developed in version 2.0, which seems a strategic blunder given the growth of interest in knowledge graphs, Bioschemas, and Wikidata. It seems that people working on traits are in a sort of pre-GBIF phase looking for ways to integrate diverse data into one or more places where people can play with it. There’s a lot of excitement here, but lots of data wrangling issues to deal with.

Credit and identity

The hashtag #citetheDOI became something of a rallying cry for those interested in GBIF data. Citing data downloaded from GBIF enables GBIF to pass information on usage along to data providers. Yet another example of the most compelling use case for identifiers not being scientific but cultural.

“Get yourself an ORCID” was another rallying cry. The challenge here is that the most obvious beneficiary of you getting an ORCID is not (yet) you, which makes the sales pitch a bit tricky.


People

It may be partly an age thing, but an increasingly important aspect of conference alike this is the chance to catchup with people you know, as well as develop new contacts and (hopefully) have your preconceptions challenged by people smarter than yourself. I spent quite a bit of time with the BHL crowd, which meant teasing them about their obsession with old books, which did not end well:

It was also fun to see Roger Hyam (@RogerHyam) in action again. Roger has a knack for cutting through the noise to make tools that are useful. He gave a nice demo of using the International Image Interoperability Framework (IIIF) to display herbarium images. Under the hood IIIF is JSON-LD and models everything as an annotation, so I think this framework is going to see a lot more use across a range of biodiversity projects. It certainly inspired me to add IIIF to my newly relaunched BioStor.


Agency


It's more a kind of pragmatic Archimedean sense that you might be able to move some subset of the world connected to any system on which you have root access or any project for which you're building a key component—from the leverage point of the command line. (The Emergence of Digital Humanities by Steven E. Jones, emphasis added)
One final thought which struck me is the notion of "agency", in the sense of a person being able to do things. For me one of the joys of biodiversity informatics is that I can make stuff that seems useful (if only to me). If, say, BHL ignores articles, well, you can grab their data and build something that finds those articles. If the data is available, and you can code, then there are few limits to what you can do. Even if you can't code, limits to what people can do are being removed. You have citizen scientists like @SiobhanLeachman (who presented at Biodiversity Next) revelling in the wealth of tools such as Wikipedia, Wikidata, etc. that enable her to add to biodiversity knowledge.

Do that at scale, as demonstrated by Carrie Seltzer's keynote on iNaturalist, and you can get millions of data points added by a passionate, empowered community.

Yet, I would find myself talking to biodiversity professionals working at some of the world's leading museums and herbaria, and they had far less agency than someone like Siobhan. They have no influence over the databases and software they use, even trivial changes aren't made because... reasons. Seemingly obvious suggestions of things that could be done, or offers of additional data are met with responses along the lines of "even if you gave us that data, we couldn't do anything with it because there's not a field in our database."

As a somewhat cranky, independent-minded academic, I greatly value the freedom to create things, and I'm extremely lucky that I can do that. But it is interesting to see that people fascinated by science but who are not employed as scientists often have more agency than the professional scientists. And maybe that's why I'm resistant to large conferences such as Biodiversity Next (and processors such as e-Biosphere 09). They represent the increasing professionalisation of the field, and with that often comes decreasing agency. When I grow up, I want to be a citizen scientist.

Wednesday, August 21, 2019

Ozymandias in Canberra

On Tuesday I was in Canberra to visit the Australian National Insect Collection at CSIRO and give a talk on knowledge graphs. David Yeates, who was a post doc at the AMNH at the same time I was (more years ago than I care to remember), played host and provided lots of stimulating conversation on the state of taxonomy and systematics, the Atlas of Living Australia (ALA), the bias against small organisms (see the wonderful essay A Dream of Invertebrate Utopia), and the joys of code compliance and modern publishing (see "Are taxonomic publications involving nomenclatural acts on Early View Code compliant?" https://doi.org/10.1111/aen.12372).


My talk discussed the Ozymandias knowledge graph, and also show cased the demos Nicole Kearney and I had put together to show the ways we think the ALA could be enhanced using knowledge graphs. One of these (linking names to the literature) has already been discussed here (see Messages from Melbourne: Towards linking all the things). The second demo (Hero images) gives examples of taxa for which ALA has no images, despite such images being available in the published literature via the Biodiversity Literature Repository. For example, the weevil genus Trigonopterus Fauvel, 1862 is richly illustrated in "Revision of the Australian species of the weevil genus Trigonopterus Fauvel" https://doi.org/10.3897/zookeys.556.6126. With a SPARQL query we can link these images to the associated taxa and provide a richer user experience.


The third demo makes use of Wikidata queries to display information on authors of taxonomic work on Australian species. This is very much a work in progress, but could be extended into a directory of Australian taxonomists.

One application of such a directory could be to determine to what extent Australian taxonomy depends on international researchers. Initial results (https://w.wiki/6PX) show that researchers from multiple countries have contribute to knowledge about Australian animal taxonomy and systematics.

There is still a frighteningly large amount of data cleaning and linking to do, but I think we've only scratched the surface of how knowledge graphs can be used to enrich biodiversity databases.

Monday, July 15, 2019

Notes on collections, knowledge graphs, and Semantic Web browsers

While working with linked data and ways to explore and visualise information, I keep coming back to the Haystack project, which is now over a decade old. Among the tools developed was the Haystack application, which enabled a user to explore all sorts of structured data. Below is a screen shot of Haystack showing a sequence for Homo sapiens cyclin T1 (CCNT1), transcript variant a, mRNA. Note the use of a LSID to identify the sequence (LSIDs were actively being used to identify bioinformatics resources) urn:lid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank:nm_001240.



For some background on the Haystack project see How to Make a Semantic Web Browser DOI:10.1145/988672.988707 (PDF) and Haystack: A Customizable General-Purpose Information Management Tool for End Users of Semistructured Data PDF.
One reason I keep coming back to the Haystack project is the notion of having a personal space for exploring linked data. One of the challenges of having a large knowledge graph is that it becomes hard to have "local" queries. That is, queries which are restricted to a subset of things that you care about.

For example, while playing around with Ozymandias I keep coming across interesting species, such as Milyeringa justitia (see FIGURE 5 in A new species of the blind cave gudgeon Milyeringa (Pisces: Gobioidei, Eleotridae) from Barrow Island, Western Australia, with a redescription of M. veritas Whitley).


If I want to explore this taxon in more detail I'd like to have the original description, any relevant DNA sequences (e.g., MG543430), any papers publishing those sequences (e.g., Multiple molecular markers reinforce the systematic framework of unique Australian cave fishes (Milyeringa : Gobioidei)), and phylogenetic analyses such as the paper The First Record of a Trans-Oceanic Sister-Group Relationship between Obligate Vertebrate Troglobites which establishes a link between Milyeringa and a genus of cave fish endemic to Madagascar (Typhleotris).

What I'd like to be able to do is collect all these sources (ideally by simply bookmarking the links), saving them as a "collection", then at some point exploring what the knowledge graph can tell me. The importance of having a collection is so that I can tell the knowledge graph that I just want to explore a subset of information. Without a collection it can be tricky to limit the scope of queries. For example, given a global knowledge graph such as Wikidata, how would you query just species found in Australia? You would typically rely on the species having either a property ("found in Australia"), or perhaps an identifier that is only used for Australian species. Neither of these is particularly satisfactory, especially if there isn't a property that fortuitously matches the scope or your inquiry.
Hence, I'm interested in having collections: lists of entities that I want to know more about. I need ways to create these collections, ways to describe them, and ways to explore them. In some ways the collections feature of EOL was close to what I'm after. In the previous version of EOL you could "collect" taxa that you were interested in (for example, species that were blue) (see I think I now "get" the Encylopedia of Life). Sadly, collections (along with JSON-LD export and stable image URLs) have vanished from the new EOL (which seems to be in a death spiral driven by some really unfortunate decisions). And collections need to be able to contain any entity, not just taxa.

One way to represent collections in the linked data world is using RSS feeds, or their schema.org descendant, the DataFeed (see also Google's Data Feed Validation Tool). So, we could collect a series of things we are interested in, create the corresponding DataFeed, import that into our Knowledge Graph and that would give us a way to scope our queries (using membership of the DataFeed to select the species, papers, sequences, etc. that we are interested in). As an aside, there's also some overlap with another MIT project of old, David Huynh's Parallax project which explored querying on a set of objects, rather than one object at a time. This is the functionality that a collection gives you (if you have a query language like SPARQL which can work on sets of things).

Returning to Haystack, I'm intrigued by the idea of building a personal linked data browser. In other worlds, a browser that stores data that is relevant to projects you are working on (e.g., blind fish) as collections (data feeds), but can query a global knowledge graph to augment that information. SPARQL supports federated queries, so this is eminently doable. The local browser would have its own triple store, which could be implemented using Linked Data Fragments.

For now this is just a jumble of poorly articulated ideas, but I think much of the power of linking data together will be lost until we have simple tools that enable us to explore the data in ways that are relevant to what we actually want to know. Haystack gives us one model of what such a tool could look like.

Friday, June 21, 2019

Messages from Melbourne: Towards linking all the things

I'm doing some work with Nicole Kearney (@nicolekearney) at the Melbourne Museum on the general theme of "linking all the things". It's the end of the first full week we've had, so here's a quick update of what we've been up to.

Brainstorming

The things we want to do are being captured as a project on GitHub. This is where we come up with ideas, comment on then, then try to figure out which ones can be done. So far there are three things we've made a serious start on.

Unpaywall

Unpaywall is a project by Impactstory. It is sort of a Sci-Hub without the legal issues (for the record, I think Alexandra Elbakyan's work on Sci-Hub is nothing short of heroic). Unpaywall scans open access archives for legal, freely available versions of articles and makes them easy to find. If you have Firefox or Chrome you can get a plugin that lights up if the paywall article you're looking at has a free version somewhere else.
Nicole has long wanted the BHL to provide data to Unpaywall, because BHL has open access versions of many papers relevant to taxonomy and biodiversity more broadly defined. After a bit of digging we figured out that Unpaywall didn't have access to BHL's data, so we've set about fixing that. We've got the data harvested, but we're still waiting for Unpaywall to process that data. So, for now, we're still waiting for the little green light to appear on pages such as this one: https://doi.org/10.1080/00222932208632640.


Adding taxonomic literature to Atlas of Living Australia

Part of "linking all the things" is making the taxonomic literature a first class citizen of biodiversity databases. It is frankly embarrassing to see how much better the scientific literature is handled by projects such as Wikipedia than scientific databases such as GBIF and the ALA. We've decided to try and do something about this by showing how easily the literature could be embedded into the existing ALA web site. Nicole crafted a mockup of the ALA names tab, and I wrote some code to make it "live". For example, if you click on this link you will see a list of publications for Pauropsalta herveyensis Owen & Moulds, 2016. Note that we have DOIs and links to BHL where ever possible (and we use Unpaywall's API to flag whether an article with a DOI is freely available). We want this literature (the primary evidence for what we know about a species) to be visible and accessible. The demo is powered by my Ozymandias project, but we hope to work out a mechanism for delivering the mapping between taxa and literature to ALA (and, indeed, anyone else) as a dataset.
Because Ozymandias only has data for animals, we've had to exclude plants from this demo. I'm frantically trying to figure out how to work with data in Australia's plant name databases to resolve this. I'm discovering that never mind having more than one name for the same species, taxonomists also delight in having many different ways of representing taxonomic information in their databases. So, plants will be a challenge.


Mapping taxonomists to ORCID and Wikidata

One reason for adding literature to taxonomic databases is to make the work of taxonomists more visible. One way to do this is to move beyond using only "dumb strings" as people names and linking taxonomists to their ORCIDs and to entries in Wikidata (this is something I touched on in Ozymandias, and David Shorthouse is doing on an epic scale in Bloodhound). We're playing with the idea of being able to generate a list of active taxonomists in Australia, linked to their identifiers and publications, solely based on querying Wikidata. The first step is to try and automate the initial mapping between taxonomists and Wikidata as much as possible, we've only just started looking at this.

Summary

It is early days, and we're still identifying things we could work on. As always, there are so manythings which could be done, we're hoping we can make progress on at least some of these in the next few weeks.