Showing posts with label Pensoft. Show all posts
Showing posts with label Pensoft. Show all posts

Tuesday, June 18, 2024

Nanopubs, a way to create even more silos

How to cite: Page, R. (2024). Nanopubs, a way to create even more silos https://doi.org/10.59350/6nj85-7te92

Pensoft have recently introduced “nanopubs”, small structured publications that can be thought of as containing the minimum possible statement that could be published.

Nanopublications are the smallest units of publishable information: a scientifically meaningful assertion about anything that can be uniquely identified and attributed to its author and serve to communicate a single statement, its original source (provenance) and citation record (publication info). Nanopublications are fully expressed in a way that is both human-readable and machine-interpretable. For more, see https://nanopub.net, Pensoft blog, this video and on our website. Nanopublications

Nanopubs are promoted as FAIR, that is findable, accessible, interoperabile, and reusable. I like the idea of nanopubs, but the examples I have seen so far are problematic. As an aside, there are reasons not to be optimistic about nanopubs (or text-mining in general), see The Business of Extracting Knowledge from Academic Publications.

I’m going to focus on one nanopub RAXCvEZfCc, which comes from the paper Towards computable taxonomic knowledge: Leveraging nanopublications for sharing new synonyms in the Madagascan genus Helictopleurus (Coleoptera, Scarabaeinae). This nanopub says that Helictopleurus dorbignyi Montreuil, 2005 is a subjective synonym of Helictopleurus halffteri Balthasar, 1964.

In other words,

This seems a fairly simple thing to say, indeed we could say it with a single triple, but the corresponding nanopub requires 33 RDF triples to say this.

<https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasAssertion> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasProvenance> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.nanopub.org/nschema#hasPublicationInfo> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.nanopub.org/nschema#Nanopublication> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#Head> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://w3id.org/biolink/vocab/OrganismTaxonToOrganismTaxonAssociation> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <http://www.w3.org/2000/01/rdf-schema#comment> "Subjective synonymy based on morphological comparison of the type specimens of the two species names" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/object> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#objtaxon> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/predicate> <http://purl.obolibrary.org/obo/NOMEN_0000285> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/biolink/vocab/subject> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#subjtaxon> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#objtaxon> <https://w3id.org/kpxl/biodiv/terms/hasTaxonName> <https://www.checklistbank.org/dataset/9880/taxon/3K9T4> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#subjtaxon> <https://w3id.org/kpxl/biodiv/terms/hasTaxonName> <https://www.checklistbank.org/dataset/9880/taxon/3K9ST> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://rs.tdwg.org/dwc/terms/basisOfRecord> <http://rs.tdwg.org/dwc/terms/PreservedSpecimen> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://www.w3.org/ns/prov#wasAttributedTo> <https://orcid.org/0000-0002-1938-6105> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#assertion> <http://www.w3.org/ns/prov#wasDerivedFrom> <https://arpha.pensoft.net/preview.php?document_id=22521> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#provenance> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasAlgorithm> "RSA" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasPublicKey> "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCnFtZQdjMpPH4duOBwDybRdPo93QCanFGN8cnpyHqZRQ+FINXypUYCNRSx3VBaWZoLVB/CYCoMY0or/oxBQwl5N7Y/8Ebj+G9ZSNsSkM9uo2DL91f26Y1y2UDE7bnajG909kXQnJS1G59cqIaKyLInjMFD5vWnptysj/ljBv3NTwIDAQAB" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasSignature> "YzTUmwGRmqHiJVyU1A6rPI1bHbAJPS+Zw6hnDPWzZ9a/7TP+yM/HAf5E9BTS3HNKaCgLAHSnsRg5Q0lPauYQyJd9tbLzR6VU/WJv399Z7/qrn4EhgCULkIhrCAkuWzRtSyHMEbuzyu51ZSQCCPgMZ3HwpVtRa+gVDgqu3nsi5x4=" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#sig> <http://purl.org/nanopub/x/hasSignatureTarget> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/created> "2023-12-24T06:24:14.480Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/creator> <https://orcid.org/0000-0002-1938-6105> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/dc/terms/license> <https://creativecommons.org/licenses/by/4.0/> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/hasNanopubType> <http://purl.obolibrary.org/obo/NOMEN_0000017> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/hasNanopubType> <https://w3id.org/kpxl/biodiv/terms/BiodivNanopub> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://purl.org/nanopub/x/introduces> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#association> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://w3id.org/kpxl/biodiv/terms/BiodivNanopub> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <http://www.w3.org/2000/01/rdf-schema#label> "Helictopleurus dorbignyi Montreuil, 2005 (species) - ICZN subjective synonym - Helictopleurus halffteri Balthasar, 1964 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromProvenanceTemplate> <http://purl.org/np/RAYfEAP8KAu9qhBkCtyq_hshOvTAJOcdfIvGhiGwUqB-M> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAA2MfqdBCzmz9yVWjKLXNbyfBNcwsMmOqcNUxkk1maIM> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAR40PzxS9rmUC2lH2ct7IlYhyEib-3GXY5DkuR8wgHRw> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromPubinfoTemplate> <http://purl.org/np/RAh1gm83JiG5M6kDxXhaYT1l49nCzyrckMvTzcPn-iv90> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig> <https://w3id.org/np/o/ntemplate/wasCreatedFromTemplate> <http://purl.org/np/RAf9CyiP5zzCWN-J0Ts5k7IrZY52CagaIwM-zRSBmhrC8> <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://www.checklistbank.org/dataset/9880/taxon/3K9ST> <https://w3id.org/np/o/ntemplate/hasLabelFromApi> "Helictopleurus dorbignyi Montreuil, 2005 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> . <https://www.checklistbank.org/dataset/9880/taxon/3K9T4> <https://w3id.org/np/o/ntemplate/hasLabelFromApi> "Helictopleurus halffteri Balthasar, 1964 (species)" <https://w3id.org/np/RAXCvEZfCcjYuH5DWOIujBehGQt61y_nRHWssw9u6aYig#pubinfo> .

In part this is because it includes cryptographic signing, presumably to ensure that the statement is what you think it is. There is also a plethora of information about how the nanopublication was derived. Presumably, this is to satisfy reproducibility concerns. But none of this matters if you are producing data that people can’t easily use.

The core statement looks like this:

This graph is saying that there is a triple

By itself this isn’t terribly useful because neither of the two taxa are “things” that have identifiers, they are blank nodes. So, what is the statement about? If we follow the biodiv:hasTaxonName links, we see that there are names associated with these taxa (Helictopleurus dorbignyi, and Helictopleurus halffteri), and these are linked to records in a database in ChecklistBank. This seems complicated, but I assume it is equivalent to saying “in this publication we regard taxa with the names Helictopleurus dorbignyi, and Helictopleurus halffteri to be the same thing”.

Interoperablity

I feel that I have been banging this drum for years now, but you cannot have interoperability unless you use the same identifiers for the same things. That means persistent identifiers, identifiers that you have some confidence will be around in ten, 20, or 50 years (at least).

Leaving aside whatever the persistence of the nanopubs themselves, I find it alarming that the link to the source of the statement that these two names are synonyms is not the DOI for the paper 10.3897/BDJ.12.e120304, but a link to the publishing platform ARPHA: https://arpha.pensoft.net/preview.php?document_id=22521. This link takes me to a login page, not the actual publication, so I can’t retrieve the source of the statement made in the nanopublication using the nanopublication itself.

The taxon names have as their identifiers https://www.checklistbank.org/dataset/9880/taxon/3K9T4 and https://www.checklistbank.org/dataset/9880/taxon/3K9ST. These identifiers are also local to a particular dataset. Why not use identifiers such as the Catalogue of Life entries for these names (i.e., e.g. https://www.catalogueoflife.org/data/taxon/3K9T4, which supports RDF via embedded JSON-LD) or even LSIDs? We have urn:lsid:organismnames.com:name:2521540 for Helictopleurus halffteri and urn:lsid:organismnames.com:name:1770738 for Helictopleurus dorbignyi.

Interestingly, the one well-known external identifier linked to is the ORCID for the author of the nanopub, 0000-0002-1938-6105). I can’t help think that this suggests that authorship of the nanopublication is more important than the fact it publishes.

One can imagine that nanopublications will be registered with authors’ ORCID profiles, which helps flesh out their online CV. This is nice, but where is the equivalent for linking the publication to the nanopub via its DOI, or the taxon names to the nanopub? How do we know whether these nanopubs contradict other nanopubs, or support them, or add new information? For example, there seems to be no way to go from the DOI for the paper to the nanopub.

Vocabulary

Another aspect of interoperability is using the same terms to describe relationships. I’m struck by how many different vocabularies the nanopub requires. Some of these are specific to the administrivia of the nanopub, but others are biological.

For example, http://purl.obolibrary.org/obo/NOMEN_0000285 is used to define the relation between. I confess it’s unclear to me why NOMEN_0000285 isn’t used to directly link the two ChecklistBank records, rather than the indirection via #subjtaxon and #objtaxon, given that is a relationship between names (isn’t it?).

Other ontologies include Biolink-Model and biodiv which I can’t seem to find a description of (the URL resolves to queries on the nanodash site). It amazes me how readily people create new ontologies, especially as in the wider world there is a trend towards one vocabuary to rule them all (schema.org).

Summary

I find it disheartening that the bulk of the information in a nanopub is administrivia about that nanopub. I understand the desire to establish provenance and to cryptographically sign the information, but all this is of limited use if the actual scientific information is poorly expressed.

If nanopubs are to be useful I think they need to:

  • Use persistent identifiers for every entity being referred to, ideally using existing, well-known identifiers. If you are referring to a publication that has a DOI, use that DOI. If you are referring to a taxon or a taxon name, use an appropriate identifier (e.g., an LSID for the name, a URL to a classification).

  • Use simple, existing vocabularies wherever possible. Can you model the data using schema.org (and extensions such as Bioschemas). If not, are you sure you can’t?

Unless more care is taken, nanopubs will go the way of much of the RDF world, creating new, even more verbose, even more arcane silos of data. This is partly a consequence of the primary incentive, which is to publish minimal units of information. Given that we now have persistent identifiers for people (ORCIDs) and those identifiers are linked to an infrastructure that can automatically register publications linked to ORCIDs, can we expect to see a flood of nanopubs? What vaue will these have if we can’t make ready use of the “facts” they assert? How will people build tools on top of nanopubs if the only thing that reliably links to the external world is the ORCID of the person who created it.

Written with StackEdit.

Thursday, June 25, 2015

Biodiversity Data Journal data lost on the way to GBIF and EOL

Two ongoing challenges in biodiversity informatics are getting data into a form that is usable, and linking that data across different projects platforms. A recent and interesting approach to this problem are "data journals" as exemplified by the Biodiversity Data Journal. I've been exploring some data from this journal that has been aggregated by GBIf and EOL, and have come across a few issues. In this post I'll firstly outline the standard format for moving data between biodiversity projects, the Darwin Core Archive, then illustrate some of the pitfalls.

Darwin Core Archive

Firstly a quick digression on the Darwin Core Archive format, which has a few gotchas for newcomers to the format (such as myself). The Darwin Core Archive supports a "star schema" like this.

Dwca

At the centre of the star is a table containing data either about taxa or occurrences. We can have additional tables with other sorts of data, and we also have a meta.xml file which tells us what all the data columns are and how the different tables are related to the core table.

For example, if we have taxa as our core, then we can have a table like this were each taxon has a unique taxon_id:

taxon_idtaxon stuff
1stuff
2stuff
3stuff

Now, imagine that we have a reference for each of these taxa (say it's the paper that originally described these species). Then we could add a unique identifier for that reference reference_id to the taxon table:

taxon_idreference_idtaxon stuff
1astuff
2astuff
3astuff

Now, if we were building a relational database we could have a separate table for the references, and link the two table using the reference_id as a primary key for the references and as a foreign key in the taxon table, like this:

reference_idreference stuff
areference

This means that we need only have the reference stored once, which means there's no redundancy. If we need to update the reference data, we only need to do it once.

However, this is not how Darwin Core Archive works. Because it's a star schema, we need to have a references table like this:

reference_idtaxon_idreference stuff
a1reference
a2reference
a3reference

Note that we have added the taxon_id to link the reference to each taxon, and that the same reference occurs three times (once for each taxon it refers to), hence we have redundancy. Note also that if we don't include the taxon_id key then there's no way for a Darwin Core Archive reader to link the reference to the corresponding taxa (we'll come back to this below).

I've said that the reference are in their own table. In fact, we can have everything in one big table, and use the meta.xml table to tell a Darwin Core Archive reader to process that same table but extract different data each time (the Mammal Species of the World checklist http://doi.org/10.15468/csfquc is an example of this). Hence, we could extract taxon_id and taxon stuff for the taxa, then reference_id, reference stuff for the references.

taxon_idreference_idtaxon stuffreference stuff
1astuffreference
2astuffreference
3astuffreference

The other thing to remember is that the meta.xml file is responsible for describing the data. It does this in two ways (1) it defines the type of data a given table contains (e.g., taxa, occurrence, image, etc.), and (2) it defines what each column in the data represents, using a controlled vocabulary.

The type of data each table contains is defined by a URI, and the list of these "registered extensions" is available from GBIF. The two "core" extensions are for taxa and occurrences, the two things GBIF primarily deals with, while the other extensions enable richer data to be added. Of course, a Darwin Core Archive consumer that doesn't understand these extensions can simply ignore them. Rather unfortunately, some extensions, such as the EOL media and references extensions overlap with the GBIF multimedia and references extensions. Hence, if you have, say images or bibliographic data, you have two extensions to choose from. If you choose EOL's then EOL will import your data, but GBIF won't. Furthermore, the extensions vary in richness. If you have bibliographic data then GBIF's vocabulary for references looks sparse and lacking many of the fields one might expect, whereas EOL's is quite rich.

Problems with Biodiversity Data Journal and GBIF

With that background, let's take a look at what happens to Biodiversity Data Journal (BDJ) data once it enters GBIF. For example, the species Eupolybothrus cavernicolus, described using "transcriptomic, DNA barcoding and micro-CT imaging data" (http://dx.doi.org/10.3897/BDJ.1.e1013). Data from this paper is in GBIF as both an occurrence dataset (http://doi.org/10.15468/zpz4ls) and checklist dataset (http://doi.org/10.15468/rpavbl).

Images

The checklist dataset includes both media and references. The images don't appear in GBIF, but are visible in EOL (e.g., http://eol.org/data_objects/26558840 shown below:

36163 orig Because the type for the media is set to a type (http://eol.org/schema/media/Document) that only EOL recognises, GBIF doesn't harvest the images, and hence misses out on all this extra multimedia goodness.

References

The references in the BDJ dataset don't appear in either GBIF or EOL (see http://eol.org/pages/38177334/literature). Presumably they don't appear in GBIF because BDJ uses EOL's extension, but why don't they appear in EOL? Looking at the raw data, the references.csv file in the Darwin Core lacks the coreid field needed to link the references to the corresponding taxon (the fiels is defined in the meta.xml file, but there is no corresponding column in the references.csv file. Looking at other BDJ Darwin Core Archives this seems to be a common problem.

Map

Strangely the BDJ paper shows a map with a point locality, but the same data in GBIF does not (see http://doi.org/10.15468/zpz4ls). Mapbdj

A look at the occurrences.csv shows that the file has verbatim latitude and longitude but not decimal versions of the coordinates, which is what GBIF uses to locate records on the map. So the BDJ data set isn't contributing any geographical data. Clearly a lot of BDJ data is georeferenced (see map), but not this example.

Taxa

The centipede Eupolybothrus cavernicolus is not in GBIF's backbone classification. This is a common issue, especially with newly described taxa. GBIF does not have access to recent nomenclatural data, and so even though the BDJ data comes with a ZooBank LSID urn:lsid:zoobank.org:act:6F9A6F3C-687A-436A-9497-70596584678C for the name Eupolybothrus cavernicolus, GBIF itself doesn't know about and so if you do a default search on the name Eupolybothrus cavernicolus you get only the genus.

Summary

Here are the issues I uncovered after a little bit of messing about:

  1. BDJ Darwin Core Archives don't support extensions recognised by GBIF.
  2. BDJ references lack the coreid for the taxa/occurrences and hence are not ingested by Darwin Core readers.
  3. BDJ does not seem to parse and interpret verbatim coordinates when generating Darwin Core Archives.
  4. GBIF doesn't support the extensions output by BDJ.
  5. GBIF's references extension is woefully inadequate for handling bibliographic metadata.
  6. GBIF's list of taxonomic names is woefully out of date.

What both puzzles and frustrates me is that a much trumpeted collaboration between these projects has significant problems which seem to have gone undetected. It seems as if it is enough to have a pipeline between a data journal and a project, without actually testing whether that pipeline loses or misrepresents the data. In some cases, very little of the data in a BDJ archive actually makes it into GBIF, which is wasteful and rather defeats the point of having a data journal to database pipeline in the first place.