Monday, July 27, 2015

The Biodiversity Data Journal is not machine readable

626ce1b38c2b42f77802e4e8c597820e 400x400In my (previous post ) I discussed the potential for the Biodiversity Data Journal (BDJ) to be a venue for nano (or near-nano publications). In this post I want to draw attention to what I think is a serious stumbling block, which is the lack of machine readable statements in the journal.

Given that the journal is probably the most progressive in the field (indeed, is suspect that there are few journals in any field as advanced in publishing technology as BDJ) this may seem an odd claim to make. The journal provides XML of its text, and typically provides data in Darwin Core Archive format, which is harvested by GBIF. The article XML is marked up to flag taxonomic names, localities, etc. Surely this is the very definition of machine readable?

The problem becomes apparent when you ask "what claims or assertions are the papers making?", and "how are those assertions reflected in the article XML and/or the Darwin Core Archive?".

For example, consider the following paper:

Gil-Santana, H., & Forero, D. (2015, June 16). Aristathlus imperatorius Bergroth, a newly recognized synonym of Reduvius iopterus Perty, with the new combination Aristathlus iopterus (Perty, 1834) (Hemiptera: Reduviidae: Harpactorinae) ​. BDJ. Pensoft Publishers.

The title gives the key findings of this paper: Aristathlus imperatorius = Reduvius iopterus, and Reduvius iopterus = Aristathlus iopterus. Yet these statements are no where to be found in the Darwin Core Archive for the paper, which simply lists the name Aristathlus iopterus. The XML markup flags terms as names, but says nothing about the relationships between the names.

Here is another example:

Starkevich, P., & Podenas, S. (2014, December 30). New synonym of Tipula (Vestiplex) wahlgrenana Alexander, 1968 (Diptera: Tipulidae). BDJ. Pensoft Publishers.
Indeed, I've yet to find an example of a paper in BDJ where a synonomy asserted in the text is reflected in the Dawrin Core Archive!

The issue here is that neither the XML markup nor the associated data files are capturing the semantics of the paper, in the sense of what the paper is actually saying. The XML and DwCA files capture (some) of the names, and localities mentioned, but not the (arguably) most crucial new pieces of information.

There is a disconnect between what the papers are saying (which a human reader can easily parse) and what the machine-readable files are saying, and this is worrying. Surely we should be ensuring that the Darwin Core Archives and/or XML markup are capturing the key facts and/or assertions made by the paper? Otherwise databases down stream will remain none the wiser about the new information the journal is publishing.