Saturday, October 24, 2020

Visualising article coverage in the Biodiversity Heritage Library

It's funny how some images stick in the mind. A few years ago Chris Freeland (@chrisfreeland), then working for Biodiversity Heritage Library (BHL), created a visualisation of BHL content relevant to the African continent. It's a nice example of small multiples.


For more than a decade (gulp) I've been extracting articles from the BHL and storing them in BioStor. My approach is to locate articles based on metadata (e.g., information on title, volume, and pagination) and store the corresponding set of BHL pages in BioStor. BHL in turn regularly harvests this information and displays these articles as "parts" on their web site. Most of this process is semi-automated, but still requires a lot of manual work. One thing I've struggled with is getting a clear sense of how much progress has been made, and how much remains to be done. This has become more pressing given work I'm doing with Nicole Kearney (@nicolekearney) on Australian content. Nicole has a team of volunteers feeding me lists of article metadata for journals relevant to Australian biodiversity, and it would be nice to see where the remaining gaps are.

So, motivated by this, and also a recent comment by Nicky Nicolson (@nickynicolson) about the lack of such visualisations, I've put together a crude tool to try and capture the current state of coverage. You can see the results here: https://rdmpage.github.io/bhl-article-coverage/.

As an example, here is v.25:pt.2 (1988) of Memoirs of the Queensland Museum. Each contiguous block of colour highlights an article in this volume:


This scanned item-level view is constructed for each item (typically a volume or an issue). I then generate a PNG bitmap thumbnail of this display for each volume, and display them together in a page for the corresponding journal (e.g., Memoirs of the Queensland Museum):

So at a glance we can see the coverage for a journal. Gray represents pages that have not been assigned to an article, so if you want to add articles to BHL those are the volumes to focus on.

There's an argument to be made that it is crazy to spend a decade extracting some 226,000 articles semi-automatically. Ideally we could use a tool like machine learning to identify articles in BHL. It would be a huge time saver if we could simply run a BHL article through a tool that could (a) extract article-level metadata and (b) associate that with the corresponding set of scanned pages. Among the possible approaches would be to develop a classifier that would assign each page in a scanned volume to a category such as "article start", "article end", "article", "plate", etc. In effect, we want a tool that can could segment scans into articles (and hence reproduce the coverage diagrams shown above) simply based on attributes of the page images. This doesn't solve the entire problem, we still need to extract metadata (e.g., article titles), but it would be a start. However, it poses the classic dilemma, do I keep doing this manually, or do I stop adding articles and take the time to learn a new technology in the hope that eventually I will end up adding more articles than if I'd persisted with the manual approach?


Wednesday, October 21, 2020

GBIF Challenge success

Somewhat stunned by the fact that my DNA barcode browser I described earlier was one of the (minor) prizewinners in this year's GBIF Ebbe Nielsen Challenge. For details on the winner and other place getters see ShinyBIOMOD wins 2020 GBIF Ebbe Nielsen Challenge. Obviously I'm biased, but it's nice to see the challenge inspiring creativity in biodiversity informatics. Congratulations to everyone who took part.

My entry is live at https://dna-barcode-browser.herokuapp.com. I had a difficult time keeping it online over the summer due to meow attacks, but managed to sort that out. Turns out the cloud provider I used to host Elasticsearch switched from securing the server by default to making it completely open to anyone, and I'd missed that change.

Given that the project was a success, I'm tempted to revisit it and explore further ways to combine phylogenetic trees in a biodiversity browser.


Wednesday, September 23, 2020

Using the hypothes.is API to annotate PDFs

With somewhat depressing regularity I keep cycling back to things I was working on earlier but never quite get to work the way I wanted. The last couple of days it's the turn of hypothes.is.

One of the things I'd like to have is a database of all taxonomic names such that if you clicked on a name you would get not only the bibliographic record for the publication where that name first appeared (which is what I've bene building for animals in BioNames) but also you could see the actual publication with the name highlighted in the text. This assumes that the publication has been digitised (say, as a PDF) and is accessible, but let's assume that this is the case. Now, we could do this manually, but we have tools to find taxonomic names in text. And in my use case I often know which page the name is on, and what the name is, so all I really want is to be able to highlight it programmatically (because I have millions of names to deal with).

So, time to revisit the hypothes.is API. One of the neat "tricks" hypothes.is have managed is the ability to annotate, say, a web page for an article and have that annotation automagically appear on the PDF version of the same article. As described in How Hypothesis interacts with document metadata this is in part because hypothes.is extracts metadata from the article's web page, such as DOI and link to the PDF, and stores that with the annotation (I say "in part" because the other part of the trick is to be able to locate annotations in different versions of the same text). If you annotate a PDF, hypothes.is stores the URL of the PDF and also a "fingerprint" of the PDF (see PDF Fingerprinting for details). This means that you can also add an annotation to a PDF offline (for example, on a file you have downloaded onto your computer) and - if hypothes.is has already encountered this PDF - that annotation will appear in the PDF online.

What I want to do is have a PDF, highlight the scientific name, upload that annotation to hypothes.is so that the annotation is visible online when anyone opens the PDF (and ideally when they look at the web version of the same article). I want to do this programmatically. Long story short, this seems doable. Here is an example annotation that I created and sent to hypothesis.is via their API:

{
    "uri": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
    "document": {
        "highwire": {
            "doi": [
                "10.1590/1678-476620151053372375"
            ]
        },
        "dc": {
            "identifier": [
                "doi:10.1590/1678-476620151053372375"
            ]
        },
        "link": [
            {
                "href": "urn:x-pdf:6124e7bdb33241429158b11a1b2c4ba5"
            }
        ]
    },
    "tags": [
        "api"
    ],
    "target": [
        {
            "source": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
            "selector": [
                {
                    "type": "TextQuoteSelector",
                    "exact": "Alpaida venger sp. nov.",
                    "prefix": "imens preserved in 75% ethanol. ",
                    "suffix": " (Figs 1-9) Type-material. Holot"
                },
                {
                    "type": "TextPositionSelector",
                    "start": 4834,
                    "end": 4857
                }
            ]
        }
    ],
    "user": "acct:xxx@hypothes.is",
    "permissions": {
        "read": [
            "group:__world__"
        ],
        "update": [
            "acct:xxx@hypothes.is"
        ],
        "delete": [
            "acct:xxx@hypothes.is"
        ],
        "admin": [
            "acct:xxx@hypothes.is"
        ]
    }
}

In this example the article and the PDF are linked by including the DOI and PDF fingerprint in the same annotation (thinking about this I should probably also have included the PDF URL in document.highwire.pdf_url[]). I extracted the PDF fingerprint using mutool and added that as the urn:x-pdf identifier.

The actual annotation itself is described twice, once using character position (start and end of the text string relative to the cleaned text extracted from the PDF) and once by including short fragments of text before and after the bit I want to highlight (Alpaida venger sp. nov.). In my limited experience so far this combination seems to provide enough information for hypothes.is to also locate the annotation in the HTML version of the article (if one exists).

You can see the result for yourself using the hypothes.is proxy (https://via.hypothes.is). Here is the annotation on the PDF (https://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf)


and here is the annotation on HTML (https://doi.org/10.1590/1678-476620151053372375)



If you download the PDF onto your computer and open the file in Chrome you can also see the annotation in the PDF (to do this you will need to install the hypothes.is extension for Chrome and click the hypothes.is symbol on your Chrome's toolbar).

In summary, we have a pretty straightforward way to automatically annotate papers offline using just the PDF.

Friday, September 11, 2020

Darwin Core Million reminder, and thoughts on bad data

Bob mesibovThe following is a guest post by Bob Mesibov.

No winner yet in the second Darwin Core Million for 2020, but there are another two and a half weeks to go (to 30 September). For details of the contest see this iPhylo blog post. And please don’t submit a million RECORDS, just (roughly) a million DATA ITEMS. That’s about 20,000 records with 50 fields in the table, or about 50,000 records with 20 fields, or something arithmetically similar.


The purpose of the Darwin Core Million is to celebrate high-quality occurrence datasets. These are extraordinarily rare in biodiversity informatics.

I’ll unpick that. I’m not talking about the accuracy of the records. For most records, the “what”, “where”, “when” and “by whom” are probably correct. An occurrence record is a simple fact: Wilma Flintstone collected a flowering specimen of an Arizona Mountain Dandelion 5 miles SSE of Walker, California on 27 June 2019. More technically, she collected Agoseris parviflora at 38.4411 –119.4393, as recorded by her handheld GPS.

What could possibly go wrong in compiling a dataset of simple records like that in a spreadsheet or database? Let me count a few of the ways:

  • data items get misspelled or misnumbered
  • data items get put in the wrong field
  • data items are put in a field for which they are invalid or inappropriate
  • data items that should be entered get left out
  • data items get truncated
  • data items contain information better split into separate fields
  • data items contain line breaks
  • data items get corrupted by copying down in a spreadsheet
  • data items disagree with other data items in the same record
  • data items refer to unexplained entities (“habitat type A”)
  • paired data items don’t get paired (e.g. latitude but no longitude)
  • the same data item appears in different formats in different records
  • missing data items are represented by blanks, spaces, “?”, “na”, “-”, “unknown”, “not recorded” etc, all in the same data table
  • character encoding failures create gibberish, question marks and replacement characters (�)
  • weird control characters appear in data items, and parsing fails
  • dates get messed up (looking at you, Excel)
  • records get duplicated after minor edits

In previous blog posts (here and here) I’ve looked at explanations for poor-quality data at the project, institution and agency level — data sources I referred to collectively as the “PIA”. I don’t think any of those explanations are controversial. Here I’m going to be rude and insulting and say there are three further obstacles to creating good, usable and shareable occurrence data:

Datasets are compiled as though they were family heirlooms.

The PIA says “This database is OUR property. It’s for OUR use and WE understand the data, even if it’s messy and outsiders can’t figure out what we’ve done. Ambiguities? No problem, we’ll just email Old Fred. He retired a few years back but he knows the system back to front.”

Prising data items from these heirlooms, mapping them to new fields and cleaning them are complicated exercises best left to data specialists. That’s not what happens.

Datasets are too often compiled by people with inadequate computer skills. Their last experience of data management was building a spreadsheet in a “digital learning” class. They’re following instructions but they don’t understand them. Both the data enterers and their instructors are hoping for a good result, which is truly courageous optimism.

The (often huge) skills gap between the compilers of digital PIA data and the computer-savvy people who analyse and reformat/repackage the data (users and facilitators-for-users) could be narrowed programmatically, but isn’t. Hands up all those who use a spreadsheet for data entry by volunteers and have comprehensive validation rules for each of the fields? Thought so.

People confuse software with data. This isn’t a problem restricted to biodiversity informatics, and I’ve ranted about this issue elsewhere. The effect is that data compilers blame software for data problems and don’t accept responsibility for stuff-ups.

Sometimes that blaming is justified. As a data auditor I dread getting an Excel file, because I know without looking that the file will have usability and shareability issues on top of the usual spreadsheet errors. Excel isn’t an endpoint in a data-use pipeline, it’s a starting point and a particularly awful one.

Another horror is the export option. Want to convert your database of occurrence records to format X? Just go to the “Save as” or “Export data” menu item and click “OK”. Magic happens and you don’t need to check the exported file in format X to see that all is well. If all is not well, it’s the software’s fault, right? Not your problem.

In view of these and the previously blogged-about explanations for bad data, it’s a wonder that there are any high-quality datasets, but there are. I’ve audited them and it’s a shame that for ethical reasons I can’t enter them myself in the current Darwin Core Million.

Wednesday, August 26, 2020

Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu


I stumbled across this tweet yesterday (no doubt when I should have been doing other things), and disappeared down a rabbit hole. Emerging, I think the trip was worth it.

 

Markdown wikis


Among the tools listed by @zackfan01 were Obsidian and Roam, neither of which I heard of before. Both Obsidian and Roam are pitched as "note-taking" apps, they are essentially personal wikis where you write text in Markdown and use [[some text goes here]] to create links to other pages (very like a Wiki). Both highlight backlinks, that is, clearly displaying "what links here" on each page, making it easy to navigate around the graph you are creating by linking pages. Users of Obsidian share these graphs in Discord, rather like something from Martin MacInnes' novel "Gathering Evidence". Personal wikis have been around for a long time, but these apps are elegantly designed and seem fun to use. Looking at these apps I'm reminded of my earlier post Notes on collections, knowledge graphs, and Semantic Web browsers where I moaned about the lack of personal knowledge graphs that supported inference from linked data. I'm also reminded of the Blue Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and opportunities lost where I constructed an interactive tool to navigate BBC data on species and their ecology (you can see this live at https://rdmpage.github.io/bbc-wildlife/www/), and the fun to be had from simply being able to navigate around a rich set of links. I imagine these Markdown-based wikis could be a great way further explore these ideas.


 

Personal and global knowledge graphs


Then I began thinking about what if the [[page links]] in these personal knowledge graphs were not just some text but, say, a Wikidata identifier (of the form "Qxxxxx")? Imagine that if you were writing notes on say, a species, you could insert the Wikidata Qid and you would get a pre-populated template that comes with some facts from Wikidata, and you could then use that as a starting point (see for example Toby Hudson Entity Explosion I discussed earlier). Knowing that more and more scholarly papers are being added to Wikidata, this means you could also add bibliographic citations as Qids, fetching all the necessary bibliographic information on the fly from Wikidata. So your personal knowledge graph intersects with the global graph.

 

Roam


Now, I've not used Roam, but anyone who has is likely to balk at my characterisation of it as "just" a Markdown wiki, because there's more going on here. The Roam white paper talks about making inferences from the text using reasoning or belief networks, although these features don't seem to have had much uptake. But what really struck me as I explored Roam was the notion of not just linking to pages using the [[ ]] syntax, but also linking to parts of pages (blocks) using (( )). In the demo of Roam there are various essays, such as Paul Graham's The Refragmentation, and each paragraph is an addressable block that can be cited independently of the entire essay. Likewise, you can see what pages cite that block.


 Now in a sense these are just like fragment identifiers that we can use to link to parts of a web page, but there's something more here because these fragments are not just locations in a bigger document, they are the components of the document.

Xanadu


This strikes me as rather like Ted Nelson's vision of Xanadu, where you could cite any text at any level of granularity, and that text would be incorporated into the document you were creating via transclusion (i.e., you don't include a copy of the text, you include the actual text). In the context of Roam, this means you have the entire text you want to cite included in the system, so you can then show chunks of it and build up a network of ideas around each chunk. This also means that the text being worked on becomes part of the system, rather than remaining isolated, say, as a PDF or other representation. This also got me thinking about the Plazi project, where taxonomic papers are being broken into component chunks (e.g., figures, taxonomic descriptions, etc.) and these are then being stored in various places and reassembled - rather like Frankenstein's monster - in new ways, for example in GBIF (e.g., https://www.gbif.org/species/166240579) or Species-ID (see doi:10.3897/zookeys.90.1369 ). One thing I've always found a little jarring about this approach is that you lose the context of the work that each component was taken from. Yes, you can find a link to the original work and go there, but what if you could seamlessly click on the paragraph or figure see them as part of the original article? Imagine we had all the taxonomic literature available in this way, so that we can cite any chunk, remix it (which is a key part of floras and other taxonomic monographs), but still retain the original context?

Summary


To come back full circle, in some ways tools like Obsidian and Roam are old hat, we've had wikis for a while, the idea of loading texts into wikis is old (e.g., Wikisource), backlinks are nothing new, etc. But there's something about seeing clean, elegant interpretations of these ideas, free of syntax junk, and accompanied by clear visions of how software can help us think. I'm not sure I will use either app, but they have given me a lot of food for thought.

Tuesday, August 25, 2020

Entity Explosion: bringing Wikidata to every website

A week ago Toby Hudson (@tobyhudson) released a very cool Chrome (and now Firefox) extension called Entity Explosion. If you install the extension, you get a little button you can press to find out what Wikidata knows about the entity on the web page you are looking at. The extension works on web sites that have URLs that match identifiers in Wikidata. For example, here it is showing some details for an article in BioStor (https://biostor.org/reference/261148). The extension "knows" that this article is about the Polish arachnologist Wojciech Staręga.



But this is a tame example, see what fun Dario Taraborelli (@ReaderMeter) is having with Toby's extension:

There are some limitations. For instance, it requires that the web site URL matches the identifier, or more precisely the URL formatter for that identifier. In the case of BioStor the URL formatter URL is https://biostor.org/reference/$1 where $1 is the BioStor identifier stored by Wikidata (e.g., 261148). So, if you visit https://biostor.org/reference/261148 the extension works as advertised.

However, identifiers that are redirects to other web sites, such as DOIs, aren't so lucky. A Wikidata item with a DOI (such as 10.1371/JOURNAL.PONE.0133602) corresponds to the URL https://doi.org/10.1371/JOURNAL.PONE.0133602, but if you click on that URL eventually you get taken to https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0133602, which isn't the original DOI URL (incidently this is exactly how DOIs are supposed to work).

So, it would be nice if Entity Explosion would also read the HTML for the web page and attempt to extract the DOI from that page (for notes on this see https://www.wikidata.org/wiki/Wikidata_talk:Entity_Explosion#Retrieve_identifiers_from_HTML_<meta>_tag), which means it would work on even more webs sites for academic articles.

Meantime, if you use Chrome of Firefox as your browser, grab a copy and discover just how much information Wikidata has to offer.

Workshop On Open Citations And Open Scholarly Metadata 2020 talk

I'm giving a short talk at the Workshop On Open Citations And Open Scholarly Metadata 2020, which will be held online on September 9th. In the talk I touch on citation patterns in the taxonomic literature, the recent Zootaxa impact factor story, and mention a few projects I'm working on: To create the presentation I played around with mmhmm, which (pretty obviously) I still need to get the hang of...

Anyway, video below:

 

Friday, August 21, 2020

Taxonomic concepts: a possible way forward

Reading the GitHub issue Define objective rules for taxon concept identity referred to by Markus Döring in a comment on a previous post, I'm once again struck by the unholy mess generated by any discussion of "taxonomic concepts". The sense of déjà vu is overwhelming. What drives me to distraction is how little of this seems to be directed at solving actual problems that biologists have, which are typically things like "what does this random name that I've come across refer to?" and "will you please stop changing the damn names!?".

One thing that's also struck me is the importance of stable identifiers for species, that is, identifiers that are stable even in the face of name changes. If you have that, then you can talk about changes in classification, such as moving a species from one genus to another.

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

In the diagram above, the species being moved from Sasia to Verreauxia has the same identifier ("afrpic1") regardless of what genus it belongs to. This enables us to easily determine the differences between classifications (and then link those changes to the evidence supporting the change). I find it interesting that projects that manage large classifications, such as eBird and the Reptile database use stable species identifiers (either externally or internally). If you are going to deal with classifications that change over time you need stable identifiers.

So I'm beginning to think that perhaps the single most useful thing we could do as a taxonomic database community is to mint stable identifiers for each unique, original species name. These could be human readable, for example the species epithet plus author name plus year, suitably cleaned up (e.g., all lower case). So, our species could be "sapiens-linnaeus-1758". This sort of identifier is inspired by the notion of uninomial nomenclature:

If the uninomial system is not accepted, or until it is, I see no hope of ever arriving at a really stable nomenclature. - Hubbs (1930)

For more reading, see for example

  • Cantino, D. P., Bryant, H. N., Queiroz, K. D., Donoghue, M. J., Eriksson, T., Hillis, D. M., & Lee, M. S. Y. (1999). Species Names in Phylogenetic Nomenclature. Systematic Biology, 48(4), 790–807. doi:10.1080/106351599260012
  • Hubbs, C. L. (1930). SCIENTIFIC NAMES IN ZOOLOGY. Science, 71(1838), 317–319. doi:10.1126/science.71.1838.317
  • Lanham, U. (1965). Uninominal Nomenclature. Systematic Zoology, 14(2), 144. doi:10.2307/2411739
  • Michener, C. D. (1963). Some Future Developments in Taxonomy. Systematic Zoology, 12(4), 151. doi:10.2307/2411757
  • Michener, C. D. (1964). The Possible Use of Uninominal Nomenclature to Increase the Stability of Names in Biology. Systematic Zoology, 13(4), 182. doi:10.2307/2411777

Just to be clear I'm NOT advocating replacing binomial names with uninomial names (the references above are just to remind me about the topic), but approaches to developing uninomial names could be used to create simple, human-friendly identifiers. Oh, and hat tip to Geoff Read for the comment on an earlier post of mine that probably planted the seed that started me down this track.

So, imagine going to a web site and with the uninomial identifier being able to get the list of every variation on that name, including species names being in different genera (in other words, all the objective or homotypic synonyms of that name).

OK, nice, but what about taxa? Well the second thing I'd like to get is every (significant) use of that name, coupled with a references (i.e., a "usage"). These would include cases where the name is regarded as a synonym of another name. Given that each usage is dated (by the reference), we then have a timestamped record of the interpretation of taxa referred to by that name. Technically, what I envisage is that we are tracking nomenclatural types, that is, for a given species name we are returning every usage that refers to a taxon that includes the type specimen of that name.

We could imagine doing something trivial such as putting "/n/" before the identifier to retrieve all name variations, and "/t/" to retrieve all usages. One could have a suffix for a timestamp (e.g., "what was the state of play for this name in 1960?")

It seems that something like this would help cut through a lot of the noise around taxa. By itself, a list of names and references doesn't specify everything you might want to know about a taxon, but I suspect that some of the things taxonomists ask for (e.g., every circumscription, every set of "defining" characters, every pairwise relationship between every variation on a taxon's interpretation) are both unrealistic and probably not terribly useful.

For example, circumscriptions (defining a taxon by the set of things it includes) are often mentioned in discussions of taxon concepts, but in reality (at the species level) how many explicit circumscriptions do we have in the taxonomic literature? I'd argue that the circumscriptions that we do have the are the ones being generated by modern databases such as GBIF, iNaturalist, BOLD, and GenBank. These explicitly link specimens, photos, or sequences to a taxon (defined locally within that database, e.g. by an integer number), and in some cases are testable, e.g., BLAST a sequence to see if it falls in the same set of sequences. These databases have their own identifiers and notions of what comprises a taxon (e.g., based on community editing, automated clustering, etc.).

This approach of simple identifiers that link multiple name variations would support the name-based matching that is at the heart of matching records in different databases (despite the wailing that names ≠ taxa, this is fundamentally how we match things across databases). The availability of timestamp usages would enable us to view a classification at given point in time.

This needs to be fleshed out more, and I really want to explore the idea of edit scripts (or patch files) for comparing taxonomic classifications, and how we can use them to document the evidence for taxonomic changes. More to come...

Wednesday, August 19, 2020

Taxonomic concepts continued: All change at ALA and AFD

Continuing my struggles with taxa (see Taxonomic concepts continued: iNaturalist) I now turn to the Atlas of Living Australia (ALA) and the Australian Faunal Directory (AFD), which have perhaps the most fluid taxon identifiers ever. In 2018 I downloaded data from ALA and AFD and used it to create a knowledge graph ("Ozymandias", see GBIF Challenge Entry: Ozymandias and https://ozymandias-demo.herokuapp.com for the web interface to the knowledge graph).

One thing I discovered is that the taxon identifiers used by ALA change... a lot. It almost feels that every time I revisit Ozyamndias and compare it to the ALA things have changed. For example, here is the fly species Acupalpa albimanis (Kröber, 1914) which you can see at https://ozymandias-demo.herokuapp.com/?uri=https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06.





The "https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06" part of the Ozymandias URL is the URL for this species in the Atlas of Living Australia. Well, it was at the time I built Ozymandias (2018). Now (19 August 2020), it is https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:69e1b774-9875-4ff1-ba20-5c4eeed866dc. If you put https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06 into your web browser, you will get redirected to the new URL (under the hood you get a HTTP 302 response with the "Location" header value set to https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:69e1b774-9875-4ff1-ba20-5c4eeed866dc).

So, seemingly, our notion of what Acupalpa albimanis (Kröber, 1914) is has changed since 2018. In fact, ALA itself is out of date, because if you replace the "https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:" bit with "https://biodiversity.org.au/afd/taxa/" you get taken to the AFD site, which is the source of the data for ALA. But ALA says that "https://biodiversity.org.au/afd/taxa/69e1b774-9875-4ff1-ba20-5c4eeed866dc" is old news and is (as of 28 February 2020) "https://biodiversity.org.au/afd/taxa/6f91e39e-8d73-4133-901c-7f4ba1771e30":

So the identifier for this taxon keeps changing. For the life of me I can't figure out why. If I compare the CSV file I downloaded 23 February 2018 with one I downloaded today (19 August 2020, CSV file is available from https://biodiversity.org.au/afd/taxa/THEREVIDAE/names/csv) the only differences are the time stamps for NAME_LAST_UPDATE and TAXON_LAST UPDATE, and the UUIDs used for NAME_GUID, TAXON_GUID, PARENT_TAXON_GUID, and CONCEPT_GUID fields. So, the administrivia has changed. Other than that, the data in the two files for this fly are identical, so why the change in identifiers? It seems bizarre to create identifiers that regularly change (and then ha=ve to maintain the associated redirects to try and keep the old identifiers functional) when the data itself seems unchanged.

Now, AFD isn't the only project to regularly change identifiers, an older version of the Catalogue of Life also did this, although it wasn't always clear to me how they did that - see Catalogue of Life and LSIDs: a catalogue of fail and the paper "Identifying and relating biological concepts in the Catalogue of Life" https://doi.org/10.1186/2041-1480-2-7.

As I disappear further down this rabbit hole (why oh why did I start doing this?) I'm beginning to suspect part of the issue here is versioning (argh!), and what we are seeing is various different ways people are trying to cope with that, the problem of what identifiers should change, and how and when, and what part of the information about a taxon a given taxon identifier should point to (the name, a use of that name, the underlying concept, the database record that tracks a taxon, the latest version of that record, etc.). Given that different databases tackle these issues differently (and not always consistently), the notion that we can easily map these identifiers to each other, and to third party identifiers such as Wikidata seems a bit, um, optimistic...

Monday, August 17, 2020

Taxonomic concepts continued: iNaturalist

Following on from my earlier post ("Taxonomic concepts for dummies"), Beckett Sterner commented:

Maybe one productive use case would be to look at what it would take for wikidata to handle taxa (=name+concept) in a way that didn't lose relevant taxonomic information when receiving content from platforms like iNaturalist that has a fairly sophisticated strategy https://www.inaturalist.org/pages/taxon_frameworks

iNaturalist is interesting, but I'm not convicted that it is internally inconsistent. As a quick rule of thumb, I'm looking for patterns of how name changes relate to taxon identifier changes. For example, we can have cases where a database retains the same taxon identifier (the columns) even if names (rows) change (such as eBird or Avibase). For example, if we move a species from one genus to the next, the name changes but (arguably) the taxon itself doesn't (it's still the same set of organisms, just moved to a different part of our classification or, if you like, tagged differently).

Taxa
Names    
   

Then we can have cases like this, where the name (row) is the same but the taxon changes. This might be where we split a taxon in two, and one remaining part retains the original name. So you could argue that the taxon has changed (i.e., in composition) even if the name hasn't.

Taxa
Names    
   

Now, many taxonomic databases seem to something different: every time the name changes we have a new identifier, even if the taxon bearing that name hasn't changed (i.e., it has the same set of organisms as before), so we get a name change and a taxon identifier change:

Taxa
Names    
   

Because we have databases that use different approaches to how they use name and taxon identifiers, life can get complicated, especially for projects such as Wikidata that try and synthesise information across all of these databases.

I haven't looked in detail at iNaturalist, but I have found cases of both

Taxa
Names    
   

and

Taxa
Names    
   

In some cases iNaturalist will change a taxon idenifier even if the name remains the same. For example, the "Thrush-like Schiffornis" Schiffornis turdina https://www.inaturalist.org/taxa/8793 has been split into five taxa, one of which bears the same scientific name (Schiffornis turdina https://www.inaturalist.org/taxa/513975). Given that the composition of Schiffornis turdina has changed, there is an argument to be made that its taxon identifier should change, which is what iNaturalist does.

So, it looks like iNaturalist is using its "taxa" identifiers to identify taxa, but then have cases such as the transfer of the African piculet Sasia africana https://www.inaturalist.org/taxa/18393 to Verreauxia africana https://www.inaturalist.org/taxa/792894, or the transfer of Heraclides rumiko https://www.inaturalist.org/taxa/428606 to Papilio rumiko https://www.inaturalist.org/taxa/509627. In both cases nothing has changed about those species, yet the identifiers have changed (for an example of a "true" taxon identifier, note that NCBI has the same identifier for Sasia africana/Verreauxia africana and for Heraclides rumiko/Papilio rumiko).

My sense is that part of the problem is that we are trying to overload identifiers, which in and of themselves don't tell us much. For example, some might argue that any change in an entity requires a change in the entity's identifier, because the underlying thing has changed. Others might argue that such a change risks making things harder to find (for example, how do we now connect the earlier version of a thing with the newer version, given that the identifier has changed?). In the case of taxonomy, I think we could possibly avoid some of this grief if we acknowledge that names and identifiers have their limits, and that we should decouple them from trying to track changes in the things they point too. Rather, what we could really do with is a timestamped versioning system where we can ask "OK, in 1960, what did this genus look like?". Likewise, when looking at a system such as Wikidata, we shouldn't expect to have a complete view of every taxonomic opinion ever held. But we could aim for a current "snapshot".

Update

I asked a question about this on the iNaturalist forum and it appears that iNaturalist treats taxon ids as essentially rows in a database, each time a name gets added you get a new integer id, and if you decide that a taxon has fundamentally changed (e.g., a species is split into two or more taxa) then you add that new taxon, generating a new integer id.

Monday, August 10, 2020

Australian museums and ALA

Bob mesibovThe following is a guest post by Bob Mesibov.

The Atlas of Living Australia (ALA) adds "assertions" to Darwin Core occurrence records. "Assertions" are indicators of particular data errors, omissions and questionable entries, such as "Coordinates are transposed", "Geodetic datum assumed WGS84" and "First [day] of the century".

Today (8 August 2020) I looked at assertions attached to records in ALA for non-fossil animals in the Australian State museums. There were 62 occurrence record collections from the seven museums (I lumped the two Tasmanian museums together), with 45 different assertions. I then calculated assertions per record for each collection. The worst performer was the Queensland Museum Porifera collection (3.84 ass/rec), and tied for best were the Museums Victoria Herpetology and Ichthyology collections (1.09 ass/rec).

I also aggregated museum collections to build a kind of league table by State:

The clear winner is Museums Victoria.

But how well do ALA's assertions measure the quality of data records? Not all that well, actually.

  • The tests used to make the assertions generate false positives and false negatives, although at a low rate
  • The tests aren't independent, so that a single data error can "smear" across several assertions
  • The tests ignore errors and omissions in DwC fields that many data users would consider important

ALA's assertions also have a strong spatial/geographical bias, with 23 of the 45 assertions in my sample dataset saying something about the "where" of the occurrence. Looking just at those 23 "where" assertions, the museums league table again shows Museums Victoria ahead, this time by a wide margin:

ALA is currently working on better ways for users to filter out records with selected assertions, in what's misleadingly called a "Data Quality Project". The title is misleading because the overall quality of ALA's holdings doesn't improve one bit. Getting data providers to fix their data issues would be a more productive way to upgrade data quality, but I haven't seen any evidence that Australian museums (for example) pay much attention to ALA's assertions. (There are no or minimal changes in assertion totals between data updates.)

It's been pointed out to me that that museum and herbarium records amount to only a small fraction of ALA's ca 90 million records, and that citizen scientists are growing the stock of occurrence records far faster than institutions do. True, and those citizen science records are often of excellent quality (see https://www.datafix.com.au/BASHing/2020-02-05.html). However, citizen science observations are strongly biased towards widespread and common species. ALA's records for just six common Australian birds (5,072,599 as of 8 August 2020; https://dashboard.ala.org.au/) outnumber all the museum animal records I looked at in the assertion analysis (4,669,508).

In my humble view, the longer ALA's institutional data providers put off fixing their mistakes, the less valuable ALA becomes as a bridge between biodiversity informatics and biodiversity science.


Wednesday, July 22, 2020

DNA barcode browser

Motivated by the 2020 Ebbe Nielsen Challenge I've put together an interactive DNA barcode browser. The app is live at https://dna-barcode-browser.herokuapp.com.


A naturalist from the 19th century would find little in GBIF that they weren’t familiar with. We have species in a Linnean hierarchy, their distributions plotted on a map. This method of summarising data is appropriate to much of the data in GBIF, but impoverishes the display of sequence data such as barcodes. Given a set of DNA barcodes we can compute a phylogeny for those sequences, and gain evidence for taxonomic groups, intraspecific genetic structure, etc. So I wanted to see if it was possible to make simple tool to interactively explore barcode data. This means we need fast methods for searching for similar sequences, and building phylogenies. I've been experimenting with ways to do this for the last couple of years, but have only now managed to put something together. For more details, see the repository. There is also a quick introductory video.

Friday, July 17, 2020

Taxonomic concepts for dummies

[Work in progress]

The "dummy" in this case is me. I'm trying to make sense of how to model taxa, especially in the context of linked data, and projects such as Wikidata where there is uncertainty over just what a taxon in Wikidata actually represents. There is also ongoing work by the TDWG Taxon Names and Concepts Interest Group. This is all very rough and I'm still working on this, but here goes.

I'm going to assume that names and publications are fairly unproblematic. We have a "controlled thesaurus of scientific names" to use Franck Michel et al.'s phrase, provided by nomenclators, and we have publications. Both names and publications have identifiers.




Then we have "usages" or "instances", which are (name, publication) pairs. These can be of different level of precision, such as "this name appears in this book", or it "ppears on page 5", or in this paragraph, etc. I tend to view "instances" as like annotations, as if you highlighted a name in some text. Indeed, indexing the taxonomic literature and recording the location of each taxonomic name is essentially generating the set of all instances. We can have identifiers for each instance.



Then we can have relationships between instances. For example, in taxonomic publications you will often see a researcher list previous mentions of a name in other works, synonyms, etc. So we can have links between instances (which is one reason why they need identifiers).





OK, now comes the tricky bit. Up to now things are pretty simple, we have strings (names), we have sets of strings (publications), the location of strings in publications (usages/instances/annotations), and relationships between those instances (e.g., there may be a list of them on the same page in a publication). What about taxa?

It seems to me there are at least two different ways people model taxa in databases. The first is the classic Darwin Core spreadsheet style model of one unique name per row. If a name is "accepted" it has a parent, if it isn't "accepted" then it doesn't have a pointer to a parent, but does have a pointer to an accepted name:




If we treat each row as an instance (e.g., name plus original publication of that name), then the identifier for the instance is the same as the identifier for the taxon. A practical consequence of this is that if the name of a taxon changes, so does the identifier, therefore any data attached to that identifier can be orphaned (or, at least, needs to be reconnected) when there is taxonomic change.

In reality, the rows aren't really usages, they are simply names (mostly) without associated publications. We could be generous and model these as instances where we lack the publication, but basically we have rows of names and pointers between them. So taxa = accepted names. This is the model used by ITIS and GBIF.




Now, a different model is to wrap one or more instances into a set and call that a taxon, and the taxon points to an instance that is contains the currently accepted name. The taxon has a parent that is itself a taxon. This separates taxa and names, and has the practical consequence that we can have a taxon identifier that persists even if the taxonomic name changes. Hence data linked to the taxon remains linked even if the name changes. And every name applied to a taxon has the same taxon identifier. This makes it possible to track changes in classifications using the diff approach I've discussed earlier. This is the model used by (as far as I can make out) the Australian NSL (site seems perpetually offline), eBird, and Avibase, for example.



Wikidata has something very like GBIF, there is essentially no distinction between taxa and names. hence the identifiers associated with a Wikidata "taxon" can be identifiers for names (e.g., from IPNI) or for taxa (e.g., Avibase), with no clear distinction made between these rather different things.

So, what to do, and does this matter? Maybe part of the problem is the notion that identifiers for taxa should be unique, that is, only one Wikidata item can have an Avibase ID. This would work if a Wikidata taxon was, in fact, a taxon (and that it matched the Avibase classification). Perhaps we could treat Wikidata "taxa" as something like an instance, and label each Wikidata item that belongs in the same taxon (according to particular identifier being used) with the same taxon identifier?

More to come...

Persistent Identifiers: A demo and a rant

This morning, as part of a webinar on persistent identifiers, I gave a live demo of a little toy to demonstrate linking together museum and herbaria specimens with publications that use those specimens. A video of an earlier run through of the demo appears below, for background on this demo see Diddling with semantic data: linking natural history collections to the scientific literature. The slides I used in this demo are available here: http://pid-demonstrator.herokuapp.com/demo/.

One thing which struck me during this webinar is that discussions about persistent identifiers (PIDs) (also called "GUIDs") seem to endlessly cycle through the same topics, it is as if each community has to reinvent everything and rehash the same debates before they reach a solution. An alternative is to try and learn from systems that work.

In this respect CrossRef is a great example:

  1. They have (mostly) brand neutral, actionable identifiers that resolve to something of value (DOIs resolve to an article you can read).
  2. They are persistent by virtue of redirection (a DOI is a pointer to a pointer to the thing, bit like a P.O. Box number). The identifiers are managed.
  3. They have machine readable identifiers that can be used to support an ecosystem of services (e.g., most references manages just need you to type in a DOI and they do the rest, altimetric “donuts", etc.).
  4. They have tools for discoverability, in other words, if you have the metadata for an article they can tell you what the corresponding DOI is.
  5. The identifiers deliver something of value to publishers that use them, such as the citation graph of interconnected articles (which means your article is automatically linked to other article) and you can get real time metrics of use (e.g., DOIs being cited in Wikipedia articles).
  6. There is a strong incentive for publishers to use other publisher's identifiers (DOIs) in their own content because if that is reciprocated then you get web traffic (and citation counts). If publishers use their own local identifiers for external content they lose out on these benefits.
  7. There are services to handle cases when things break. If a DOI doesn’t resolve, you can talk to a human who will attempt to fix it.

My point is that there is an entire ecosystem built around DOIs, and it works. Typically every community considering persistent identifiers attempts to build something themselves, ideally for “free", and ends up with a mess, or a system that doesn’t provide the expected benefits, because the message they got was “we need to get PIDs” rather than “build an infrastructure to enable these things that we need”.

I think we can also learn from systems that failed. In biology, LSIDs failed for a bunch of reasons, mainly because they didn’t resolve to anything useful (they were only machine readable). Also they were free, which seems great, except it means there is no cost to giving them up (which is exactly what people did). Every time someone advocates a PID that is free they are advocating a system that is designed to fail. Minting a UUID for everything costs nothing and is worth nothing. If you think a URL to a website in your organisation's domain is a persistent identifier, just think what happened to all those HTTP URLs stored in databases when post Edward Snowden the web switched to HTTPS.

One issue which came up in the Webinar was the status of ISBNs, which aren't actionable PIDs (in the sense that there's no obvious way to stick one in a web browser and get something back). ISBNs have failed to migrate to the web because, I suspect, they are commercially valuable, which means a fight over who exploits that value. Whoever provides the global resolver for ISBNs then gets to influence where you buy the corresponding book. Book publishers were slow to sell online, so Amazon gobbled up the online market, so in effect the defect global resolver (Amazon) makes all the money. The British Library got into trouble for exactly this reason when they provided links to Amazon (see British Library sparks Amazon row). Furthermore, unlike, DOIs for the scholarly literature, there aren’t really any network effects for ISBNs - publishers don’t benefit from other publishers having their content online. So I think ISBNs are an interesting example of the economics of identifiers, and the challenging problem of identifiers for things that are not specific to one organisation. It's much easier to think about identifiers for your stuff because you control that stuff and how it is represented. But who gets to decide on the PID for, say, Homo sapiens?

So, while we navigate the identifier acronym soup (DOI, LSID, ORCID, URL, URI, IRI, PURL, ARK, UUID, ISBN, Handles) and rehash arguments that multiple communities have been having for decades, maybe it's a good time to pause, take a look at other communities, and see what has worked and what hasn't, and why. It may well be that in many cases the kinds of drivers that make CrossRef a successes (identifiers return something of value, network effects as the citation graph grows, etc.) might not exist in many heritage situations, but that in itself would be useful to know, and might help explain why we have been sluggish in adopting persistent identifiers.

Wednesday, July 15, 2020

Darwin Core Million now twice a year

Bob mesibovThe following is a guest post by Bob Mesibov.

The first Darwin Core Million closed on 31 March with no winner. Since I'm seeing better datasets this year in my auditing work for Pensoft, I've decided to run the competition every six months.

Missed the first Darwin Core Million and don't know what it's about? Don't get too excited by the word "million". It refers to the number of data items in a Darwin Core occurrences table, not to the prize!

The rules

  • Anyone can enter, but the competition applies only to publicly available Darwin Core occurrence datasets. These might have been uploaded to an aggregator, such as GBIF, ALA or iDigBio, or to an open-data repository.
  • Select about one million data items from the dataset. That could be 50000 records in 20 populated Darwin Core fields, or 20000 records in 50 populated Darwin Core fields, or something in between. Email the dataset to me after 1 September and before 30 September as a zipped, plain-text file, together with a DOI or URL for the online version of the dataset.
  • I'll audit datasets in the order I receive them. If I can't find serious data quality problems (see below) in your dataset, I'll pay your institution AUD$150 and declare your institution the winner of the Darwin Core Million here on iPhylo. There's only one winner in each competition round; datasets received after the first problem-free dataset won't be checked.
  • If I find serious data quality problems, I'll let you know by email. If you want to learn what the problems are, I'll send you a report detailing what should be fixed and charge your institution AUD$150. At 0.3-0.75c/record, that's a bargain compared to commercial data-checking rates. And it would be really good to hear, later on, that those problems had indeed been fixed and that corrected data items had replaced the originals online.

How the data are judged

For a list of data quality problems, see this page in my Data Cleaner's Cookbook. The key problems I look for are:

  • duplicate records
  • invalid data items
  • missing-but-expected items
  • data items in the wrong fields
  • data items inappropriate for their field
  • truncated data items
  • records with items in one field disagreeing with items in another
  • character encoding errors
  • wildly erroneous dates or coordinates
  • incorrect or inconsistent formatting of dates, names and other items

This is not just nit-picking. Your digital data items aren't mainly for humans to read and interpret, they're intended in the first place for parsing and managing by computers. "Western Hill" might not be the same as "Western Hill" in processing, for example, because the second placename might have a no-break space between the words instead of a plain space. Another example: humans see these 22 variations on collector names as "the same", but computers don't.

Please also note that data quality isn't the same as data accuracy. Is Western Hill really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? These are questions about data accuracy. But it's possible to have entirely correct digital data that can't be processed by an application, or moved between applications, because the data suffer from one or more of the problems listed above.

Fine points

I think I'm pretty reasonable about the "serious" in "serious data quality problems". One character encoding error, such as "L'H?rit" repeated in the "scientificNameAuthorship" field, isn't serious, but multiple errors scattered through several fields are grounds for rejection.

For an understanding of "invalid", please refer to the Darwin Core field definitions and recommendations.

"Missing-but-expected" is important. I've seen GBIF mis-match a scientific name because the Darwin Core "kingdom" field was left blank by the data provider, even though all the other higher-taxon fields were filled in.

Please remember, entries received before 1 September won't be audited.

Thursday, July 09, 2020

Zootaxa has no impact factor

So this happened:

Zootaxa is a hugely important journal in animal taxonomy:

On one hand one could argue that impact factor is a bad way to measure academic impact, so it's tempting to say this simply reflects a poor metric that is controlled by a commercial company using data that is not open. But it quickly became clear on Taxacom that in some countries the impact factor of the journal you publish in is very important (to the point where it has a major effect on your personal income, and whether it is financially possible for you to continue your career). This discussion got rather heated, but it was quite eye opening to someone like me who casually dismisses impact factor as not interesting.

Partly in response to this I spent a little time scraping the Zootaxa web site to put together a list of all the literature cited by articles published in Zootaxa. Normally this sort of thing dies a lingering death on my hard drive, but this time I've got myself more organised and created a GitHub project for the code and I've uploaded the data to Figshare doi:10.6084/m9.figshare.c.5054372.v1. Regardless of the impact factor issue, it's potentially a fascinating window into centuries of taxonomic publications.

Lists of species don't matter: thoughts on "Principles for creating a single authoritative list of the world’s species"

Garnett et al. recently published a paper in PLoS Biology that starts with the sentence "Lists of species matter":

Garnett, S. T., Christidis, L., Conix, S., Costello, M. J., Zachos, F. E., Bánki, O. S., … Thiele, K. R. (2020). Principles for creating a single authoritative list of the world’s species. PLOS Biology, 18(7), e3000736. doi:10.1371/journal.pbio.3000736

This paper (one of a forthcoming series) is pretty much the kind of paper I try and avoid reading. It has lots of authors so it is a paper by committee, those authors all have a stake in particular projects, and it is an acronym soup of organisations the paper is pitched at. It's a well-worn strategy: write one or more papers outlining making the case that there is a problem, then get funding based on the notion that clearly there's a problem (you've published papers saying so) and that you and your co-applicants are best placed to solve it (clearly, because you wrote the papers identifying the problem in the first place). I'm not criticising the strategy, it's how you get things done in science. It just makes for a rather uninspiring read.

From my perspective focussing on "lists" is a mistake. Lists don't really matter, it is what is on the list that counts. And I think this is where the real prize is. As I play with Wikidata I'm becoming increasingly aware of the clusterfuck mess the taxonomic database community has created by conflating taxonomic names with taxa, and by having multiple identifiers for the same things. We survive this mess by relying on taxonomic names as somewhat fuzzy identifiers, and the hope that we can communicate successfully with other people despite this ambiguity (I guess this is pretty much the basis of all communication). As Roger Hyam notes:

These taxon names we are dealing with are really just social tags that need to be organised in a central place.

Having lots of names (tags) is fine, and Wikidata is busy harvesting all these taxonomic names and their identifiers (ITIS, IPNI, NCBI, EOL, iNaturalist, eBird, etc., etc., etc.). For most of these names all we have is a mapping to other identifiers for the same name, a link to a parent taxon, and sometimes a link to a reference for the name. But what happens if we want to attach data to a taxon? Take, for example, the African Piculet Verreauxia africana. This bird has at least two scientific names, each with a separate entry in Wikidata: Verreauxia africana Q28123873 and Sasia africana Q1266812. These are the same species yet it has two entries in Wikidata. If I want to add, say, body weight, or population size, or longevity, which Wikidata item do I add that data too?

What we need is an identifier for the species, an identifier that remains unchanged even if the name changes, or if that species moves in the taxonomic hierarchy. Some databases do this already. For example the eBird identifier for Verreauxia africana/Sasia africana is afrpic1. Because the identifier remains unchanged we can do things such as "diffs" between successive classifications showing how the species has moved between different genera (see Taxonomic publications as patch files and the notion of taxonomic concepts):

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

Ironically it seems that for birds the common name (in this case "African Piculet") is a more stable identifier than the scientific name (although that may well change). By having stable taxon identifiers we can then decide what entity to attach biological data to. Taxonomic names have failed to do this, but are still vital as well known tags. The actual taxon identifiers should be opaque identifiers (like "afrpic1" - not really opaque but close enough - or Avibase's C4DFB5E31495AE94). Make each opaque identifier a DOI, use existing taxonomic names as formalised tags so we aren't disconnected from the literature, use timestamped versions to track changes in species classification over time, and we have something useful.

This, I think, is the real prize. Rather than frame the task as making a list of species so that organisations can have a checklist they can all share, why not frame it as providing a framework that we can hang trait data on? We have vast quantities of data residing in siloed databases, spreadsheets, and centuries of biological literature. The argument shouldn't be about what is on a list, it should be how we group that information together and enable people to do their science. By providing stable identifiers that are resistant to name changes we can confidently associate trait data with taxa. Taxonomy could then actually be what it should be, the organisational framework for biological information (see Taxonomy as Information Science).

Wednesday, July 01, 2020

Diddling with semantic data: linking natural history collections to the scientific literature

A couple of weeks ago I was grumpy on the Internet (no, really) and complained about museum websites and how their pages often lacked vital metadata tags (such as rel=canonical or Facebook Open Graph tags). This got a response:
Vince's lovely line "diddle with semantic data" is the inspiration for the title of this post, in which I describe a tool to display links across datasets, such as museum specimens and scientific publications. This tool builds on ideas I discussed in 2014(!) (Rethinking annotating biodiversity data, see also "Towards a biodiversity knowledge graph" doi:10.3897/rio.2.e8767).


TL;DR;

If you want the quick summary, here it is. If we have persistent identifiers (PIDs) for specimens and publications (or anything other entities of interest), and we have a databases of links between pairs of PIDs (e.g., paper x mentions specimen y), and both entities have web pages, then we can display that relationship on both web pages using a Javascript bookmarklet. We can do this without permission, in the sense that the specimen web page can be hosted by a museum (e.g., The Natural History Museum in London) and the publication hosted by a publisher (e.g., The Public Library of Science), and neither organisation need know about the connection between specimen and publication. But because we do, we can add that connection. (Under the hood this approach relies on a triplestore that stores the relationships between pairs of entities using the Web Annotation Data Model.)


Background

Consider the web page https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef which is for a specimen of the cestode Gangesia agraensis in the NHM (catalogue number 2012.8.23.3). If you look at this page the information content is pretty minimal, which is typical of many natural history specimens. In particular, we have no idea if anyone has done anything with this specimen. Has it been sequenced, imaged, or mentioned in a publication? Who knows? We have no indication of the scientific importance or otherwise of this specimen.



Now, consider the page https://doi.org/10.1371/journal.pone.0046421 for the PLoS ONE paper Revision of Gangesia (Cestoda: Proteocephalidea) in the Indomalayan Region: Morphology, Molecules and Surface Ultrastructure. This paper has most of bells and whistles of a modern paper, including metrics of attention. However, despite this paper using specimens from the NHM there is no connection between the the paper and the museum's collections.


Making these connections is going to be important for tracking the provenance of knowledge based on those specimens, as well as developing metrics of collection use. Some natural collection web sites have started to show these sort of links, but we need them to be available on a much larger scale, and the links need to be accessible not just on museum sites but everywhere specimens are used.  Nor is this issue restricted to natural history collections. My use of "PIDs" in this blog post (as opposed, say, to GUIDs) is that part of the motivation for this work is my participation in the Towards a National Collection - HeritagePIDs project (@HeritagePIDs), whose scope includes collections and archives from nay different fields.


Magic happens

The screenshots below show the same web pages as before, but now we have a overlay window that displays additional information. For specimen 2012.8.23.3 we see a paper (together with a list of the authors, each sporting an ORCID). This is the PloS ONE paper, which cites this specimen.




Likewise if we go to the PLoS ONE paper, we now see a list of specimens from the NHM that are mentioned in that paper.




What happened?

The overlay is generated by a bookmarklet, a piece of Javascript that displays an overlay on the right hand side of the web page, then does two things:
  1. It reads the web page to find out what the page is "about" (the main entity). It does this by looking for tags such as rel="canonical", og:url, or a meta tag with a DOI. It turns out that lots of relevant sites don't include a machine readable way of saying what they are about (which led to my tweet that annoyed Vince Smith, see above). While it may be "obvious" to a human what a site is about, we need to spell that out for computers. The easy way to do this is explicitly include a URL or other persistent identifier for the subject of the web page.
  2. Once the script has figured out what the page is about, it then talks to a triple store that I have set up and asks "do you have any annotations for this thing?". If so, they are returned as a DataFeed (basically a JSON-LD variant of an RSS feed) and the results are displayed in the overlay.
Step one hinges on the entity of interest having a persistent identifier, and that identifier being easy to locate in the web page. Academic publishers are pretty good at doing this, mainly because it increases their visibility to search engines such as Google Scholar, and also it helps reference managers such as Zotero automatically extract bibliographic data for a paper. These drivers don't exist for many types of data (such as specimens, or DNA sequences, or people), and so often those sites will need custom code to extract at the corresponding identifier.

Step two requires that we have a database somewhere that knows whether two things are linked. For various reasons I've settled on using a triplestore for this data, and I'm modelling the connection between two things as an annotation. Below is the (simplified) JSON-LD for an annotation linking the NHM specimen 2012.8.23.3 to the paper Revision of Gangesia (Cestoda: Proteocephalidea) in ... .

{
  "type": "Annotation",
  "body": {
 "id": "https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef",
 "name": "2012.8.23.3"
  },
  "target": {
 "name": "Revision of Gangesia (Cestoda: Proteocephalidea) in ...",
 "canonical": "https://doi.org/10.1371/journal.pone.0046421"
  }
}

Strictly speaking we could have something even more minimal:

{
  "type": "Annotation",
  "body": "https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef",
  "target": "https://doi.org/10.1371/journal.pone.0046421"
}

But this means we couldn't display the names of the specimen and the paper in the overlay. (The use of canonical in the target is preparation for when annotations will be made on specific representations, such as a PDF of a paper, the same paper in HTML, etc. and I want to be able to group those together.)

Leaving aside these technical details, the key thing is that we have a simple way to link two things together.


Where do the links come from?

Now we hit the $64,000 Question, how do we know that specimen https://data.nhm.ac.uk/object/6e8be646-486e-4193-ac46-e13e23c5daef and paper https://doi.org/10.1371/journal.pone.0046421 are linked? To do that we need to text mine papers looking for specimen codes (like 2012.8.23.3), discover the persistent identifier that corresponds to that code, then combine that with the persistent identifier for the entity that refers to that specimen (such as a paper, a supplementary data file, or a DNA sequence).

For this example I'm spared that work because Ross Mounce (@rmounce) and Aime Rankin (@AimeRankin) did exactly that for some NHM specimens (see doi:10.5281/zenodo.34966 and https://github.com/rossmounce/NHM-specimens). So I just wrote a script to parse a CSV file and output specimen and publication identifiers as annotations. So that I can display more I also grabbed RDF for the specimens, publications, and people. The RDF for the NHM specimens is available by simply appending an extension (such as .jsonld) to the specimen URL, you can get RDF for people and their papers from ORCID (and other sources).

As an aside, I could use Ross and Aime's work "as is" because the persistent identifiers had changed (sigh). The NHM has changed specimen URLs (replacing /specimen/ with /object/) and switched from http to https. Even the DOIs have changed in that the HTTP resolver http://dx.doi.org has now been replaced by https://doi.org. So I had to fix that. If you want this stuff to work DO NOT EVER CHANGE IDENTIFIERS!


How can I get this bookmarklet thingy?

To install the bookmarklet go to https://pid-demonstrator.herokuapp.com and click and hold the "Annotate It!" Link, then drag it to your web browser toolbar (on Safari it's the "Favourites Bar", on Chrome and Firefox it's the "Bookmarks Bar"). When you are looking at a web page click "Annotate It!". At the moment the NHM PLoS example above is the only one that does anything interesting, this will change as I add more data.

My horribly rough code is here: https://github.com/rdmpage/pid-demonstrator.


What's next?

The annotation model doesn't just apply to specimens. For example, I'd love to be able to flag pages in BHL as being of special interest, such as "this page is the original description of this species"). This presents some additional challenges because the user can scroll through BHL and change the page, so I would need the bookmarklet to be aware of that and query the triplestore for each new page. I've got the first part of the code working, in that if you try the bookmarklets on a BHL page it "knows" when you've scrolled to a different page.



I obviously need to populate the triplestore with a lot more annotations. I think the simplest way forward is just to have spreadsheets (e.g., CSV files) with columns of specimen identifiers and DOI and convert those into annotations.

Lastly, another source of annotations are those made by readers using tools such as hypothes.is, which I've explored earlier (see Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday). So we can imagine a mix of annotations made by machine, and annotations made by people, both helping construct a part of the biodiversity knowledge graph. This same graph can then be used to explore the connections between specimens and publications, and perhaps lead to metrics of scientific engagement with natural history collections.