Wednesday, July 22, 2020

DNA barcode browser

Motivated by the 2020 Ebbe Nielsen Challenge I've put together an interactive DNA barcode browser. The app is live at

A naturalist from the 19th century would find little in GBIF that they weren’t familiar with. We have species in a Linnean hierarchy, their distributions plotted on a map. This method of summarising data is appropriate to much of the data in GBIF, but impoverishes the display of sequence data such as barcodes. Given a set of DNA barcodes we can compute a phylogeny for those sequences, and gain evidence for taxonomic groups, intraspecific genetic structure, etc. So I wanted to see if it was possible to make simple tool to interactively explore barcode data. This means we need fast methods for searching for similar sequences, and building phylogenies. I've been experimenting with ways to do this for the last couple of years, but have only now managed to put something together. For more details, see the repository. There is also a quick introductory video.

Friday, July 17, 2020

Taxonomic concepts for dummies

[Work in progress]

The "dummy" in this case is me. I'm trying to make sense of how to model taxa, especially in the context of linked data, and projects such as Wikidata where there is uncertainty over just what a taxon in Wikidata actually represents. There is also ongoing work by the TDWG Taxon Names and Concepts Interest Group. This is all very rough and I'm still working on this, but here goes.

I'm going to assume that names and publications are fairly unproblematic. We have a "controlled thesaurus of scientific names" to use Franck Michel et al.'s phrase, provided by nomenclators, and we have publications. Both names and publications have identifiers.

Then we have "usages" or "instances", which are (name, publication) pairs. These can be of different level of precision, such as "this name appears in this book", or it "ppears on page 5", or in this paragraph, etc. I tend to view "instances" as like annotations, as if you highlighted a name in some text. Indeed, indexing the taxonomic literature and recording the location of each taxonomic name is essentially generating the set of all instances. We can have identifiers for each instance.

Then we can have relationships between instances. For example, in taxonomic publications you will often see a researcher list previous mentions of a name in other works, synonyms, etc. So we can have links between instances (which is one reason why they need identifiers).

OK, now comes the tricky bit. Up to now things are pretty simple, we have strings (names), we have sets of strings (publications), the location of strings in publications (usages/instances/annotations), and relationships between those instances (e.g., there may be a list of them on the same page in a publication). What about taxa?

It seems to me there are at least two different ways people model taxa in databases. The first is the classic Darwin Core spreadsheet style model of one unique name per row. If a name is "accepted" it has a parent, if it isn't "accepted" then it doesn't have a pointer to a parent, but does have a pointer to an accepted name:

If we treat each row as an instance (e.g., name plus original publication of that name), then the identifier for the instance is the same as the identifier for the taxon. A practical consequence of this is that if the name of a taxon changes, so does the identifier, therefore any data attached to that identifier can be orphaned (or, at least, needs to be reconnected) when there is taxonomic change.

In reality, the rows aren't really usages, they are simply names (mostly) without associated publications. We could be generous and model these as instances where we lack the publication, but basically we have rows of names and pointers between them. So taxa = accepted names. This is the model used by ITIS and GBIF.

Now, a different model is to wrap one or more instances into a set and call that a taxon, and the taxon points to an instance that is contains the currently accepted name. The taxon has a parent that is itself a taxon. This separates taxa and names, and has the practical consequence that we can have a taxon identifier that persists even if the taxonomic name changes. Hence data linked to the taxon remains linked even if the name changes. And every name applied to a taxon has the same taxon identifier. This makes it possible to track changes in classifications using the diff approach I've discussed earlier. This is the model used by (as far as I can make out) the Australian NSL (site seems perpetually offline), eBird, and Avibase, for example.

Wikidata has something very like GBIF, there is essentially no distinction between taxa and names. hence the identifiers associated with a Wikidata "taxon" can be identifiers for names (e.g., from IPNI) or for taxa (e.g., Avibase), with no clear distinction made between these rather different things.

So, what to do, and does this matter? Maybe part of the problem is the notion that identifiers for taxa should be unique, that is, only one Wikidata item can have an Avibase ID. This would work if a Wikidata taxon was, in fact, a taxon (and that it matched the Avibase classification). Perhaps we could treat Wikidata "taxa" as something like an instance, and label each Wikidata item that belongs in the same taxon (according to particular identifier being used) with the same taxon identifier?

More to come...

Persistent Identifiers: A demo and a rant

This morning, as part of a webinar on persistent identifiers, I gave a live demo of a little toy to demonstrate linking together museum and herbaria specimens with publications that use those specimens. A video of an earlier run through of the demo appears below, for background on this demo see Diddling with semantic data: linking natural history collections to the scientific literature. The slides I used in this demo are available here:

One thing which struck me during this webinar is that discussions about persistent identifiers (PIDs) (also called "GUIDs") seem to endlessly cycle through the same topics, it is as if each community has to reinvent everything and rehash the same debates before they reach a solution. An alternative is to try and learn from systems that work.

In this respect CrossRef is a great example:

  1. They have (mostly) brand neutral, actionable identifiers that resolve to something of value (DOIs resolve to an article you can read).
  2. They are persistent by virtue of redirection (a DOI is a pointer to a pointer to the thing, bit like a P.O. Box number). The identifiers are managed.
  3. They have machine readable identifiers that can be used to support an ecosystem of services (e.g., most references manages just need you to type in a DOI and they do the rest, altimetric “donuts", etc.).
  4. They have tools for discoverability, in other words, if you have the metadata for an article they can tell you what the corresponding DOI is.
  5. The identifiers deliver something of value to publishers that use them, such as the citation graph of interconnected articles (which means your article is automatically linked to other article) and you can get real time metrics of use (e.g., DOIs being cited in Wikipedia articles).
  6. There is a strong incentive for publishers to use other publisher's identifiers (DOIs) in their own content because if that is reciprocated then you get web traffic (and citation counts). If publishers use their own local identifiers for external content they lose out on these benefits.
  7. There are services to handle cases when things break. If a DOI doesn’t resolve, you can talk to a human who will attempt to fix it.

My point is that there is an entire ecosystem built around DOIs, and it works. Typically every community considering persistent identifiers attempts to build something themselves, ideally for “free", and ends up with a mess, or a system that doesn’t provide the expected benefits, because the message they got was “we need to get PIDs” rather than “build an infrastructure to enable these things that we need”.

I think we can also learn from systems that failed. In biology, LSIDs failed for a bunch of reasons, mainly because they didn’t resolve to anything useful (they were only machine readable). Also they were free, which seems great, except it means there is no cost to giving them up (which is exactly what people did). Every time someone advocates a PID that is free they are advocating a system that is designed to fail. Minting a UUID for everything costs nothing and is worth nothing. If you think a URL to a website in your organisation's domain is a persistent identifier, just think what happened to all those HTTP URLs stored in databases when post Edward Snowden the web switched to HTTPS.

One issue which came up in the Webinar was the status of ISBNs, which aren't actionable PIDs (in the sense that there's no obvious way to stick one in a web browser and get something back). ISBNs have failed to migrate to the web because, I suspect, they are commercially valuable, which means a fight over who exploits that value. Whoever provides the global resolver for ISBNs then gets to influence where you buy the corresponding book. Book publishers were slow to sell online, so Amazon gobbled up the online market, so in effect the defect global resolver (Amazon) makes all the money. The British Library got into trouble for exactly this reason when they provided links to Amazon (see British Library sparks Amazon row). Furthermore, unlike, DOIs for the scholarly literature, there aren’t really any network effects for ISBNs - publishers don’t benefit from other publishers having their content online. So I think ISBNs are an interesting example of the economics of identifiers, and the challenging problem of identifiers for things that are not specific to one organisation. It's much easier to think about identifiers for your stuff because you control that stuff and how it is represented. But who gets to decide on the PID for, say, Homo sapiens?

So, while we navigate the identifier acronym soup (DOI, LSID, ORCID, URL, URI, IRI, PURL, ARK, UUID, ISBN, Handles) and rehash arguments that multiple communities have been having for decades, maybe it's a good time to pause, take a look at other communities, and see what has worked and what hasn't, and why. It may well be that in many cases the kinds of drivers that make CrossRef a successes (identifiers return something of value, network effects as the citation graph grows, etc.) might not exist in many heritage situations, but that in itself would be useful to know, and might help explain why we have been sluggish in adopting persistent identifiers.

Wednesday, July 15, 2020

Darwin Core Million now twice a year

Bob mesibovThe following is a guest post by Bob Mesibov.

The first Darwin Core Million closed on 31 March with no winner. Since I'm seeing better datasets this year in my auditing work for Pensoft, I've decided to run the competition every six months.

Missed the first Darwin Core Million and don't know what it's about? Don't get too excited by the word "million". It refers to the number of data items in a Darwin Core occurrences table, not to the prize!

The rules

  • Anyone can enter, but the competition applies only to publicly available Darwin Core occurrence datasets. These might have been uploaded to an aggregator, such as GBIF, ALA or iDigBio, or to an open-data repository.
  • Select about one million data items from the dataset. That could be 50000 records in 20 populated Darwin Core fields, or 20000 records in 50 populated Darwin Core fields, or something in between. Email the dataset to me after 1 September and before 30 September as a zipped, plain-text file, together with a DOI or URL for the online version of the dataset.
  • I'll audit datasets in the order I receive them. If I can't find serious data quality problems (see below) in your dataset, I'll pay your institution AUD$150 and declare your institution the winner of the Darwin Core Million here on iPhylo. There's only one winner in each competition round; datasets received after the first problem-free dataset won't be checked.
  • If I find serious data quality problems, I'll let you know by email. If you want to learn what the problems are, I'll send you a report detailing what should be fixed and charge your institution AUD$150. At 0.3-0.75c/record, that's a bargain compared to commercial data-checking rates. And it would be really good to hear, later on, that those problems had indeed been fixed and that corrected data items had replaced the originals online.

How the data are judged

For a list of data quality problems, see this page in my Data Cleaner's Cookbook. The key problems I look for are:

  • duplicate records
  • invalid data items
  • missing-but-expected items
  • data items in the wrong fields
  • data items inappropriate for their field
  • truncated data items
  • records with items in one field disagreeing with items in another
  • character encoding errors
  • wildly erroneous dates or coordinates
  • incorrect or inconsistent formatting of dates, names and other items

This is not just nit-picking. Your digital data items aren't mainly for humans to read and interpret, they're intended in the first place for parsing and managing by computers. "Western Hill" might not be the same as "Western Hill" in processing, for example, because the second placename might have a no-break space between the words instead of a plain space. Another example: humans see these 22 variations on collector names as "the same", but computers don't.

Please also note that data quality isn't the same as data accuracy. Is Western Hill really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? These are questions about data accuracy. But it's possible to have entirely correct digital data that can't be processed by an application, or moved between applications, because the data suffer from one or more of the problems listed above.

Fine points

I think I'm pretty reasonable about the "serious" in "serious data quality problems". One character encoding error, such as "L'H?rit" repeated in the "scientificNameAuthorship" field, isn't serious, but multiple errors scattered through several fields are grounds for rejection.

For an understanding of "invalid", please refer to the Darwin Core field definitions and recommendations.

"Missing-but-expected" is important. I've seen GBIF mis-match a scientific name because the Darwin Core "kingdom" field was left blank by the data provider, even though all the other higher-taxon fields were filled in.

Please remember, entries received before 1 September won't be audited.

Thursday, July 09, 2020

Zootaxa has no impact factor

So this happened:

Zootaxa is a hugely important journal in animal taxonomy:

On one hand one could argue that impact factor is a bad way to measure academic impact, so it's tempting to say this simply reflects a poor metric that is controlled by a commercial company using data that is not open. But it quickly became clear on Taxacom that in some countries the impact factor of the journal you publish in is very important (to the point where it has a major effect on your personal income, and whether it is financially possible for you to continue your career). This discussion got rather heated, but it was quite eye opening to someone like me who casually dismisses impact factor as not interesting.

Partly in response to this I spent a little time scraping the Zootaxa web site to put together a list of all the literature cited by articles published in Zootaxa. Normally this sort of thing dies a lingering death on my hard drive, but this time I've got myself more organised and created a GitHub project for the code and I've uploaded the data to Figshare doi:10.6084/m9.figshare.c.5054372.v1. Regardless of the impact factor issue, it's potentially a fascinating window into centuries of taxonomic publications.

Lists of species don't matter: thoughts on "Principles for creating a single authoritative list of the world’s species"

Garnett et al. recently published a paper in PLoS Biology that starts with the sentence "Lists of species matter":

Garnett, S. T., Christidis, L., Conix, S., Costello, M. J., Zachos, F. E., Bánki, O. S., … Thiele, K. R. (2020). Principles for creating a single authoritative list of the world’s species. PLOS Biology, 18(7), e3000736. doi:10.1371/journal.pbio.3000736

This paper (one of a forthcoming series) is pretty much the kind of paper I try and avoid reading. It has lots of authors so it is a paper by committee, those authors all have a stake in particular projects, and it is an acronym soup of organisations the paper is pitched at. It's a well-worn strategy: write one or more papers outlining making the case that there is a problem, then get funding based on the notion that clearly there's a problem (you've published papers saying so) and that you and your co-applicants are best placed to solve it (clearly, because you wrote the papers identifying the problem in the first place). I'm not criticising the strategy, it's how you get things done in science. It just makes for a rather uninspiring read.

From my perspective focussing on "lists" is a mistake. Lists don't really matter, it is what is on the list that counts. And I think this is where the real prize is. As I play with Wikidata I'm becoming increasingly aware of the clusterfuck mess the taxonomic database community has created by conflating taxonomic names with taxa, and by having multiple identifiers for the same things. We survive this mess by relying on taxonomic names as somewhat fuzzy identifiers, and the hope that we can communicate successfully with other people despite this ambiguity (I guess this is pretty much the basis of all communication). As Roger Hyam notes:

These taxon names we are dealing with are really just social tags that need to be organised in a central place.

Having lots of names (tags) is fine, and Wikidata is busy harvesting all these taxonomic names and their identifiers (ITIS, IPNI, NCBI, EOL, iNaturalist, eBird, etc., etc., etc.). For most of these names all we have is a mapping to other identifiers for the same name, a link to a parent taxon, and sometimes a link to a reference for the name. But what happens if we want to attach data to a taxon? Take, for example, the African Piculet Verreauxia africana. This bird has at least two scientific names, each with a separate entry in Wikidata: Verreauxia africana Q28123873 and Sasia africana Q1266812. These are the same species yet it has two entries in Wikidata. If I want to add, say, body weight, or population size, or longevity, which Wikidata item do I add that data too?

What we need is an identifier for the species, an identifier that remains unchanged even if the name changes, or if that species moves in the taxonomic hierarchy. Some databases do this already. For example the eBird identifier for Verreauxia africana/Sasia africana is afrpic1. Because the identifier remains unchanged we can do things such as "diffs" between successive classifications showing how the species has moved between different genera (see Taxonomic publications as patch files and the notion of taxonomic concepts):

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

Ironically it seems that for birds the common name (in this case "African Piculet") is a more stable identifier than the scientific name (although that may well change). By having stable taxon identifiers we can then decide what entity to attach biological data to. Taxonomic names have failed to do this, but are still vital as well known tags. The actual taxon identifiers should be opaque identifiers (like "afrpic1" - not really opaque but close enough - or Avibase's C4DFB5E31495AE94). Make each opaque identifier a DOI, use existing taxonomic names as formalised tags so we aren't disconnected from the literature, use timestamped versions to track changes in species classification over time, and we have something useful.

This, I think, is the real prize. Rather than frame the task as making a list of species so that organisations can have a checklist they can all share, why not frame it as providing a framework that we can hang trait data on? We have vast quantities of data residing in siloed databases, spreadsheets, and centuries of biological literature. The argument shouldn't be about what is on a list, it should be how we group that information together and enable people to do their science. By providing stable identifiers that are resistant to name changes we can confidently associate trait data with taxa. Taxonomy could then actually be what it should be, the organisational framework for biological information (see Taxonomy as Information Science).

Wednesday, July 01, 2020

Diddling with semantic data: linking natural history collections to the scientific literature

A couple of weeks ago I was grumpy on the Internet (no, really) and complained about museum websites and how their pages often lacked vital metadata tags (such as rel=canonical or Facebook Open Graph tags). This got a response:
Vince's lovely line "diddle with semantic data" is the inspiration for the title of this post, in which I describe a tool to display links across datasets, such as museum specimens and scientific publications. This tool builds on ideas I discussed in 2014(!) (Rethinking annotating biodiversity data, see also "Towards a biodiversity knowledge graph" doi:10.3897/rio.2.e8767).


If you want the quick summary, here it is. If we have persistent identifiers (PIDs) for specimens and publications (or anything other entities of interest), and we have a databases of links between pairs of PIDs (e.g., paper x mentions specimen y), and both entities have web pages, then we can display that relationship on both web pages using a Javascript bookmarklet. We can do this without permission, in the sense that the specimen web page can be hosted by a museum (e.g., The Natural History Museum in London) and the publication hosted by a publisher (e.g., The Public Library of Science), and neither organisation need know about the connection between specimen and publication. But because we do, we can add that connection. (Under the hood this approach relies on a triplestore that stores the relationships between pairs of entities using the Web Annotation Data Model.)


Consider the web page which is for a specimen of the cestode Gangesia agraensis in the NHM (catalogue number 2012.8.23.3). If you look at this page the information content is pretty minimal, which is typical of many natural history specimens. In particular, we have no idea if anyone has done anything with this specimen. Has it been sequenced, imaged, or mentioned in a publication? Who knows? We have no indication of the scientific importance or otherwise of this specimen.

Now, consider the page for the PLoS ONE paper Revision of Gangesia (Cestoda: Proteocephalidea) in the Indomalayan Region: Morphology, Molecules and Surface Ultrastructure. This paper has most of bells and whistles of a modern paper, including metrics of attention. However, despite this paper using specimens from the NHM there is no connection between the the paper and the museum's collections.

Making these connections is going to be important for tracking the provenance of knowledge based on those specimens, as well as developing metrics of collection use. Some natural collection web sites have started to show these sort of links, but we need them to be available on a much larger scale, and the links need to be accessible not just on museum sites but everywhere specimens are used.  Nor is this issue restricted to natural history collections. My use of "PIDs" in this blog post (as opposed, say, to GUIDs) is that part of the motivation for this work is my participation in the Towards a National Collection - HeritagePIDs project (@HeritagePIDs), whose scope includes collections and archives from nay different fields.

Magic happens

The screenshots below show the same web pages as before, but now we have a overlay window that displays additional information. For specimen 2012.8.23.3 we see a paper (together with a list of the authors, each sporting an ORCID). This is the PloS ONE paper, which cites this specimen.

Likewise if we go to the PLoS ONE paper, we now see a list of specimens from the NHM that are mentioned in that paper.

What happened?

The overlay is generated by a bookmarklet, a piece of Javascript that displays an overlay on the right hand side of the web page, then does two things:
  1. It reads the web page to find out what the page is "about" (the main entity). It does this by looking for tags such as rel="canonical", og:url, or a meta tag with a DOI. It turns out that lots of relevant sites don't include a machine readable way of saying what they are about (which led to my tweet that annoyed Vince Smith, see above). While it may be "obvious" to a human what a site is about, we need to spell that out for computers. The easy way to do this is explicitly include a URL or other persistent identifier for the subject of the web page.
  2. Once the script has figured out what the page is about, it then talks to a triple store that I have set up and asks "do you have any annotations for this thing?". If so, they are returned as a DataFeed (basically a JSON-LD variant of an RSS feed) and the results are displayed in the overlay.
Step one hinges on the entity of interest having a persistent identifier, and that identifier being easy to locate in the web page. Academic publishers are pretty good at doing this, mainly because it increases their visibility to search engines such as Google Scholar, and also it helps reference managers such as Zotero automatically extract bibliographic data for a paper. These drivers don't exist for many types of data (such as specimens, or DNA sequences, or people), and so often those sites will need custom code to extract at the corresponding identifier.

Step two requires that we have a database somewhere that knows whether two things are linked. For various reasons I've settled on using a triplestore for this data, and I'm modelling the connection between two things as an annotation. Below is the (simplified) JSON-LD for an annotation linking the NHM specimen 2012.8.23.3 to the paper Revision of Gangesia (Cestoda: Proteocephalidea) in ... .

  "type": "Annotation",
  "body": {
 "id": "",
 "name": "2012.8.23.3"
  "target": {
 "name": "Revision of Gangesia (Cestoda: Proteocephalidea) in ...",
 "canonical": ""

Strictly speaking we could have something even more minimal:

  "type": "Annotation",
  "body": "",
  "target": ""

But this means we couldn't display the names of the specimen and the paper in the overlay. (The use of canonical in the target is preparation for when annotations will be made on specific representations, such as a PDF of a paper, the same paper in HTML, etc. and I want to be able to group those together.)

Leaving aside these technical details, the key thing is that we have a simple way to link two things together.

Where do the links come from?

Now we hit the $64,000 Question, how do we know that specimen and paper are linked? To do that we need to text mine papers looking for specimen codes (like 2012.8.23.3), discover the persistent identifier that corresponds to that code, then combine that with the persistent identifier for the entity that refers to that specimen (such as a paper, a supplementary data file, or a DNA sequence).

For this example I'm spared that work because Ross Mounce (@rmounce) and Aime Rankin (@AimeRankin) did exactly that for some NHM specimens (see doi:10.5281/zenodo.34966 and So I just wrote a script to parse a CSV file and output specimen and publication identifiers as annotations. So that I can display more I also grabbed RDF for the specimens, publications, and people. The RDF for the NHM specimens is available by simply appending an extension (such as .jsonld) to the specimen URL, you can get RDF for people and their papers from ORCID (and other sources).

As an aside, I could use Ross and Aime's work "as is" because the persistent identifiers had changed (sigh). The NHM has changed specimen URLs (replacing /specimen/ with /object/) and switched from http to https. Even the DOIs have changed in that the HTTP resolver has now been replaced by So I had to fix that. If you want this stuff to work DO NOT EVER CHANGE IDENTIFIERS!

How can I get this bookmarklet thingy?

To install the bookmarklet go to and click and hold the "Annotate It!" Link, then drag it to your web browser toolbar (on Safari it's the "Favourites Bar", on Chrome and Firefox it's the "Bookmarks Bar"). When you are looking at a web page click "Annotate It!". At the moment the NHM PLoS example above is the only one that does anything interesting, this will change as I add more data.

My horribly rough code is here:

What's next?

The annotation model doesn't just apply to specimens. For example, I'd love to be able to flag pages in BHL as being of special interest, such as "this page is the original description of this species"). This presents some additional challenges because the user can scroll through BHL and change the page, so I would need the bookmarklet to be aware of that and query the triplestore for each new page. I've got the first part of the code working, in that if you try the bookmarklets on a BHL page it "knows" when you've scrolled to a different page.

I obviously need to populate the triplestore with a lot more annotations. I think the simplest way forward is just to have spreadsheets (e.g., CSV files) with columns of specimen identifiers and DOI and convert those into annotations.

Lastly, another source of annotations are those made by readers using tools such as, which I've explored earlier (see Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday). So we can imagine a mix of annotations made by machine, and annotations made by people, both helping construct a part of the biodiversity knowledge graph. This same graph can then be used to explore the connections between specimens and publications, and perhaps lead to metrics of scientific engagement with natural history collections.

Monday, June 08, 2020

Towards visualising classifications from Wikidata

These are simply notes to myself about taxonomic classifications in Wikidata.

Classifications in Wikidata can be complex and are often not trees. For example, if we trace the parents of the frog family Leptodactylidae back we get a graph like this:

Each oval represents a taxon in Wikidata, and each arrow connects a taxon to its parent(s) in Wikidata.  Likewise, if we do the same for the albatross genus Diomedea we get a similarly complex diagram:

The presence of multiple classifications likely reflects several factors. If you deal with just extant species you are likely to have fairly shallow classifications, for example, the kingdom, phylum, class, order, family, genus ranks used by GBIF. may be enough. Some taxonomic groups may routinely use ranks such as subfamily, and in well-studied groups there may be additional taxa based on phylogenetic research (e.g., the RTA clade in spiders). And of course, different Wikidata editors may favour different classifications.

Anecdotally (certainly for vertebrates), many of the additional levels in the classifications in Wikidata come from fossil taxa. In the case of birds, extant Aves (birds) are a fairly isolated group in the tree of life, but as we go down the tree towards their common ancestor with the crocodilians we encounter dinosaurs and other taxa. So if you are a palaeontologist the jump from, say Aves to Tetrapoda skips over a fairly significant part of the tree!

Faced with this complexity, how do we display a Wikidata classification in a simple way? One approach may be to display only a classification from a particular source, for example Mammal Species of the World. This requires that Wikidata has that classification, and enough information for you to extract it by a SPARQL query (for example if each node in the classification that is in MSW has a reference to MSW attached to that node).

Another approach is to extract a simplified classification from the sort of graphs shown above. Technically, these graphs are DAGs (Directed acyclic graphs). An obvious way to simplify a DAG is to find the shortest path in that DAG. For example, the path (Eukaryota, Animalia, Bilateria, Deuterostomia, Chordata, Olfactores, Gnathostomata, Tetrapoda, Amphibia, Anura, Leptodactylidae) is a path through the DAG shown above. Shortest paths are reasonably easy to find once you have a topological sorting of the graph (see e.g. Shortest Path in Directed Acyclic Graph). At the moment this looks the best bet for displaying classifications from Wikidata.

Preferred classifications

In some cases the classification in Wikidata is complicated, but this complexity isn’t reflected in SPARQL results because parts of that classification have different “ranks”. For example, for the plant order Fagales there are currently seven parents:
  • fabids
  • Rosanae
  • Hamamelididae
  • eurosids I
  • Monochlamydeae
  • Archichlamydeae
  • Juglandanae
One of these is flagged “Preferred rank” (fabids) and the others are “Normal rank”. As a result only the rabies appear in the list of parents.

Monday, April 20, 2020

Making sense of how Wikidata models taxonomy

Given my renewed enthusiasm for Wikidata, I'm trying to get my head around the way that Wikidata models biological taxonomy. As a first pass, here's a diagram of the properties linked to a taxonomic name. The model is fairly comprehensive, it includes relationships between names (e.g, basionym, protonym, replacement), between taxa (e.g., parent taxon), and links to the literature. It's also a complex model to query, given that a lot of information is expressed using qualifiers. Hence there's a bit of head scratching while I figure out the relationship between properties, statements, etc.

Links to the literature is one of my interests, can in cases where Wikidata has this information you can start to enhance the way we display publications, e.g.

The Wikidata model is very like that used in Darwin Core, where everything is a taxon and every taxon has a name, which means that relationships that are notionally between names and not taxa (e.g., basionym) are all treated as relationships between taxa.

One big challenge is how to interpret Wikidata as a classification, given that we expect classifications to be trees. The taxonomic classification in Wikidata is clearly not a tree, for example:

What I think is happening here is that different people are adding different parent taxa, depending on which classification they follow. Some classifications (e.g., that used by GBIF) are "shallow" with only a few levels (e.g., kingdom, phylum, class, order, family, genus), other classifications are deep (e.g., NCBI). So the idea of simply being able to do a SPARQL query and get a tree (e.g. Displaying taxonomic classifications from Wikidata using d3js and SPARQL) runs into problems. But this could also be a strength, particularly if we had a reference or source for each parent child pair. That way we could (a) store multiple classifications in Wikidata, and (b) have queries that retreive classifications according to a particular source (e.g., GBIF).

So, lots of potential, but lots I've still to learn.

Friday, April 17, 2020

A planetary computer for Earth

Came across Microsoft's announcement of a "A planetary computer for a sustainable future through the power of AI", complete with a glossy video featuring Lucas Joppa @lucasjoppa (see also @Microsoft_Green and #AIforEarth).

On the one hand it's great to see super smart people with lots of resources tackling important questions, but it's hard not to escape the feeling that this is the classic technology company approach of framing difficult problems in ways that match the solutions they have to offer. Is the reason that biodiversity is declining simply because we have lacked computational resources, that our AI isn't good enough? And while forests that have been stripped of both their mega fauna and previous human inhabitants make for photogenic backdrops, biodiversity can be a lot messier (and dangerous). Still, it will be interesting to see how this plays out, and what sort of problems the planetary computer is used to tackle.

Monday, April 13, 2020

Wikidata and the bibliography of life in the time of coronavirus

I haven't posted on iPhylo for a while, and since my last post back in January things have obviously changed quite a bit. In late January and early February I was teaching a course on biodiversity informatics, and students discovered the John Hopkins coronavirus dashboard, which seemed like a cool way to display information on a situation that was happening on the other side of the world. All fairly abstract.

Today the dashboard looks rather different, and things are no longer quite so abstract (and, of course, never were for the population of Wuhan).

At the same time as the pandemic is affecting so many lives (and taking those of people who had a big impact on my childhood), there is the extraordinary case of open access champion Jon Tennant (@protohedgehog). On April 8th I received an item from his email newsletter entitled Converting adversity into productivity, detailing how he'd managed to get through a traumatic period prior to corona virus, and how productive he had managed to be (his email lists a whole slew of articles he'd written). The next day, this:

The day before, this happened:

Times like this tend to focus the mind, and for anyone with research skills the question arises "what should I be doing?". Some people are addressing issues directly or indirectly relate to the pandemic. It feels like every second post on Medium features someone playing data scientist with coronavirus data. Others are taking existing tools and projects and looking for ways to make them relevant to the problem, such as Plazi and Pensoft seeking to improve access to the biology of corona virus hosts, as part of their broader mission to make biodiversity information more accessible.

Another approach, in some ways what Jon Tennant did, is to use the time to focus on what you think matters and work on that. Of course, this assumes that you are fortunate enough to have the time and resources to do that. I have tenure and my children are grown up, life would be very different without a salary or with small children or other dependents.

One of the things I am increasingly focussing on is the idea of Wikidata as the "bibliography of life". Specifically, I want to get as much taxonomic and related literature into Wikidata, and want to link as much of that to freely-available versions of that literature (e.g., on Internet Archive), I want that literature embedded in the citation graph, linked to authors, and linked to the taxa treated in those papers. A lot of literature is already going into Wikidata via bots that consume the stream of papers with CrossRef DOIs and upload their details to Wikidata, but there is a huge corpus of literature that this approach overlooks. Not only do we have Digital libraries like the Biodiversity Heritage Library and JSTOR, but there is a long tail of small publishers making taxonomic literature available online, and I want this to all be equally discoverable.

One aspect of this project is to populate Wikidata with this missing literature. Over the years as part of projects such as BioNames and BioStor I have accumulated hundreds of thousands of bibliographic references. These aren't much use sitting on my hard drive. Adding them to Wikidata makes them more accessible, and also enables others to make them much richer. For example, the irrepressible @SiobhanLeachman regularly converts author strings to author things:

Adding things to Wikidata is fun, but it can be a struggle to get a sense of what is in Wikidata and how it is interconnected. So I've started to build a simple app that helps show me people, publications, journals, and taxa in a fairly conventional way, all powered by Wikidata. The app is live at It is not going to win any prizes for performance or design, but I find it useful.

Partly I'm trying to make the original articles more accessible, e.g.:

I'm keen to link taxonomists to their publications and ultimately the taxa they work on:

And we can link taxa and publications visually:

The community-based, somewhat chaotic consensus-driven approach of Wikidata can be frustrating ("well, if you'd asked ME, I wouldn't have done it that way"), but I think it's time to accept that this is simply the nature of the beast, and marvel at the notion that we have a globally accessible and editable knowledge graph. We can stay in our domain-specific silos, where we can control things but remain bereft of both users and contributors. However if we are willing to let go of that control, and accept that things won't always be done the way we think would be optimal, there is a lot of freedom to be gained by deferring to Wikidata's community decisions and simply getting on with building the bibliography of life. Maybe that is something worthwhile to do in this time of coronavirus.

Monday, March 23, 2020

Darwin Core Million promo: best and worst

Bob mesibovThe following is a guest post by Bob Mesibov.
There's still time (to 31 March) to enter a dataset in the 2020 Darwin Core Million, and by way of encouragement I'll celebrate here the best and worst Darwin Core datasets I've seen.
The two best are real stand-outs because both are collections of IPT resources rather than one-off wonders.

The first is published by the Peabody Museum of Natural History at Yale University. Their IPT website has 10 occurrence datasets totalling ca 1.6M records updated daily, and I've only found minor data issues in the Peabody offerings. A recent sample audit of the 151,138 records with 70 populated Darwin Core fields in the botany dataset (as of 2020-03-18) showed refreshingly clean data:
  • entries correctly assigned to DwC fields
  • no missing-but-expected entry gaps
  • consistent, widely accepted vocabularies and formatting in DwC fields
  • no duplicate records
  • no character encoding errors
  • no gremlin characters
  • no excess whitespace or fancy alternatives to simple ASCII characters
The dataset isn't perfect and occurrenceRemarks entries are truncated at 254 characters, but other errors are scarce and easily fixed, such as
  • 14 records with plant taxa mis-classified as animals
  • 4 records with dateIdentified earlier than eventDate
  • minor pseudo-duplication in several fields, e.g. "Anna Murray Vail; Elizabeth G. Britton" and "Anne Murray Vail; Elizabeth G. Britton" in recordedBy
  • minor content errors in some entries, e.g. "tissue frozen; tissue frozen" and "|" (with no other characters in the entry).
I doubt if it would take more than an hour to fix all the Peabody Museum issues besides the truncation one, which for an IPT dataset with 10.5M data items is outstanding. There are even fields in which the Museum has gone beyond what most data users would expect. Entries in vernacularName, for example, are semicolon-separated hierarchies of common names: "dwarf snapdragon; angiosperms; tracheophytes; plants" for Chaenorhinum minus.

The second IPT resource worth commending comes from GBIF Portugal and consists of 108 checklist, occurrence record and sampling event datasets. As with the Peabody resource, the datasets are consistently clean with only minor (and scattered) structural, format or content issues.

The problems appearing most often in these datasets are "double-encoding" errors with Portugese words and no-break spaces in place of plain spaces, and for both of these we can probably blame the use of Windows programs (like Excel) at the contributing institutions. An example of double-encoding: the Portugese "prôximo" is first encoded in UTF-8 as a 2-byte character, then read by a Windows program as two separate bytes, then converted back to UTF-8, resulting in the gibberish "prôximo". A large proportion of the no-break spaces in the Portugese datasets unfortunately occur in taxon name strings, which don't parse correctly and which GBIF won't taxon-match.

And the worst dataset? I've seen some pretty dreadful examples from around the world, but the UK's Natural History Museum sits at the top of my list of delinquent providers. The NHM offers several million records and a disappointingly high proportion of these have very serious data quality problems. These include invalid and inappropriate entries, disagreements between fields and missing-but-expected blanks.

Ironically, the NHM's data portal allows the visitor to select and examine/download records with any one of a number of GBIF issues, like "taxon_match_none". Further, for each record the data portal reports "GBIF quality indicators", as shown in this screenshot:

Clicking on that indicator box gives the portal visitor a list of the things that GBIF found wrong with the record (a list that overlaps incompletely with the list I can find with a data audit). I'm sure the NHM sees this facility differently, but to me it nicely demonstrates that NHM has prioritised Web development over data management. The message I get is
"We know there's a lot wrong with our data, but we're not going to fix anything. Instead, we're going to hand our mess as-is to any data users out there, with cleverly designed pointers to our many failures. Suck it up, people."
In isolation NHM might be seen as doing what it can with the resources it has. In a broader context the publication of multitudes of defective records by NHM is scandalous. Institutions with smaller budgets and fewer staff do a lot better with their data — see above.


If your institution is closed and you have spare work-from-home time, consider doing some data cleaning. For those not afraid of the command line, I've archived the websites A Data Cleaner's Cookbook (version 2) and its companion blog BASHing data (first 100 posts) in Zenodo with local links between the two, so that the two resources can be downloaded and used offline in any Web browser.

Tuesday, March 03, 2020

The 2020 Darwin Core Million

Bob mesibovThe following is a guest post by Bob Mesibov.

You're feeling pretty good about your institution's collections data. After carefully tucking all the data items into their correct Darwin Core fields, you uploaded the occurrence records to GBIF, the Atlas of Living Australia (ALA) or another aggregator, and you got back a great report:

  • all your scientific names were in the aggregator's taxonomic backbone
  • all your coordinates were in the countries you said they were
  • all your dates were OK (and in ISO 8601 format!)
  • all your recorders and identifiers were properly named
  • no key data items were missing

OK, ready for the next challenge for your data? Ready for the 2020 Darwin Core Million?

How it works

From the dataset you uploaded to the aggregator, select about one million data items. That could be, say, 50000 records in 20 populated Darwin Core fields, or 20000 records in 50 populated Darwin Core fields, or something in between. Send me the data for auditing before 31 March 2020 as a zipped plain-text file by email to, together with a DOI or other identifier for their online, aggregated presence.

I'll audit datasets in the order I receive them. If I can't any find data quality problems in your dataset, I'll pay your institution AUD$150 and declare your institution the winner of the 2020 Darwin Core Million here on iPhylo. (One winner only; datasets received after the first problem-free dataset won't be checked.)

If I find data quality problems, I'll let you know by email. If you want to learn what the problems are, I'll send you a report detailing what should be fixed and you'll pay me AUD$150. At 0.3-0.75c/record, that's a bargain compared to commercial data-checking rates. And it would be really good to hear, later on, that those problems had indeed been fixed and corrected data had been uploaded to the aggregator.

What I look for

For a list of data quality problems, see this page in my Data Cleaner's Cookbook. The key problems are:

  • duplicate records
  • invalid data items
  • data items in the wrong fields
  • data items inappropriate for their field
  • truncated data items
  • records with items in one field disagreeing with items in another
  • character encoding errors
  • wildly erroneous dates or coordinates
  • incorrect or inconsistent formatting of dates, names and other data
  • items

If you think some of this is just nit-picking, you're probably thinking of your data items as things for humans to read and interpret. But these are digital data items intended for parsing and managing by computers. "Western Hill" might not be the same as "Western Hill" in processing, for example, because the second item might have a no-break space between the words instead of a plain space. Another example: humans see these 22 variations on collector names as "the same", but computers don't.

You might also be thinking that data quality is all about data correctness. Is Western Hill really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? But it's possible to have entirely correct digital data that can't be processed by an application, or moved between applications, because the data suffer from one or more of the problems listed above.

I think my money is safe

The problems I look for are all easily found and fixed. However, as mentioned in a previous iPhylo post, the quality of the many institutional datasets that I've sample-audited ranges from mostly OK to pretty awful. I've also audited more than 100 datasets (many with multiple data tables) for Pensoft Publishers, and the occurrence records among them were never error-free. Some of those errors had vanished when the records had been uploaded to GBIF, because GBIF simply deleted the offending data items during processing (GBIF, bless 'em, also publish the original data items).

Neither institutions nor aggregators seem to treat occurrence records with the same regard for detail that you find in real scientific data, the kind that appear in tables in scientific journal articles. A comparison with enterprise data is even more discouraging. I'm not aware of any large museum or herbarium with a Curator of Data on the payroll, probably because no institution's income depends on the quality of the institution's data, and because collection records don't get audited the way company records do, for tax, insurance and good-governance purposes.

So there might be a winner this year, but I doubt it. Maybe next year. ALA has a year-long data quality project underway, and GBIF Executive Secretary Joe Miller (in litt.) says that GBIF is now paying closer attention to data quality. The 2021 Darwin Core Million prize could be yours...

Wednesday, January 08, 2020

ORCID serves linked data via content negotiation - who knew?

Just a note that ORCID serves data using terms from, and has done for a while (since April 2018), but somehow I missed this.

You can get linked data in JSON-LD using content negotiation. If we send a request to with "Accept: application/ld+json" we get back something like this:

{ "@context": "", "@type": "Person", "@id": "", "mainEntityOfPage": "", "givenName": "Mark", "familyName": "Hughes", "affiliation": { "@type": "Organization", "name": "Royal Botanic Garden Edinburgh", "identifier": { "@type": "PropertyValue", "propertyID": "RINGGOLD", "value": "41803" } }, "@reverse": {}, "url": [ "", "" ], "identifier": [ { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "845425" }, { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "826408" } ] }

This is the profile for Mark Hughes (0000-0002-2168-0514). Up until now I've been generating my own linked data version of ORCID records that look very similar to this, but going forward this will simplify life.

Note that I've truncated the above example, it's actually this:

{ "@context": "", "@type": "Person", "@id": "", "mainEntityOfPage": "", "givenName": "Mark", "familyName": "Hughes", "affiliation": { "@type": "Organization", "name": "Royal Botanic Garden Edinburgh", "identifier": { "@type": "PropertyValue", "propertyID": "RINGGOLD", "value": "41803" } }, "@reverse": { "creator": [ { "@type": "CreativeWork", "@id": "", "name": "BEGONIA MAGUNIANA (BEGONIACEAE, BEGONIA SECT. OLIGANDRAE), A NEW SPECIES FROM NEW GUINEA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428619000283" } }, { "@type": "CreativeWork", "@id": "", "name": "A revision of Begonia sect. Petermannia on Sumatra, Indonesia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.407.1.11" } }, { "@type": "CreativeWork", "@id": "", "name": "Two new species of Begonia (Begoniaceae) from Borneo", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.407.1.4" } }, { "@type": "CreativeWork", "@id": "", "name": "AN UPDATED CHECKLIST AND A NEW SPECIES OF BEGONIA (B. RHEOPHYTICA) FROM MYANMAR", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428619000052" } }, { "@type": "CreativeWork", "@id": "", "name": "Chloroplast and nuclear DNA exchanges among Begonia sect. Baryandra species (Begoniaceae) from Palawan Island, Philippines, and descriptions of five new species.", "identifier": [ { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1371/journal.pone.0194877" }, { "@type": "PropertyValue", "propertyID": "pmc", "value": "PMC5931476" }, { "@type": "PropertyValue", "propertyID": "pmid", "value": "29718922" } ], "sameAs": [ "", "" ] }, { "@type": "CreativeWork", "@id": "", "name": "TWO NEW SPECIES OF BEGONIA FROM SUMATRA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428618000136" } }, { "@type": "CreativeWork", "@id": "", "name": "A REVISION OF BEGONIA SECT. SYMBEGONIA ON NEW GUINEA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s096042861800001x" } }, { "@type": "CreativeWork", "@id": "", "name": "A revision and one new species of Begonia L. (Begoniaceae, Cucurbitales) in Northeast India", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2018.396" } }, { "@type": "CreativeWork", "@id": "", "name": "Pliocene intercontinental dispersal from Africa to Southeast Asia highlighted by the new species Begonia afromigrata (Begoniaceae)", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1002/tax.606013" } }, { "@type": "CreativeWork", "@id": "", "name": "Taxonomic notes on the Philippine endemic Begonia colorata (Begoniaceae, section Petermannia)", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.381.1.16" } }, { "@type": "CreativeWork", "@id": "", "name": "Three new species of Begonia sect. Baryandra from Panay Island, Philippines.", "identifier": [ { "@type": "PropertyValue", "propertyID": "pmid", "value": "28664395" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/s40529-017-0182-x" }, { "@type": "PropertyValue", "propertyID": "pmc", "value": "PMC5491425" } ], "sameAs": [ "", "" ] }, { "@type": "CreativeWork", "@id": "", "name": "A new species of Begonia section Parvibegonia (Begoniaceae) from Thailand and Myanmar", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3767/000651917x695083" } }, { "@type": "CreativeWork", "@id": "", "name": "TAXONOMY OF BEGONIA ALBOMACULATA AND DESCRIPTION OF TWO NEW SPECIES ENDEMIC TO PERU", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428617000075" } }, { "@type": "CreativeWork", "@id": "", "name": "FOUR NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM THAILAND", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428617000051" } }, { "@type": "CreativeWork", "@id": "", "name": "A new species and a new record in Begonia sect. Platycentrum (Begoniaceae) from Thailand", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3850/s2382581216000077" } }, { "@type": "CreativeWork", "@id": "", "name": "Begonia yapenensis (sect. Symbegonia, Begoniaceae), a new species from Papua, Indonesia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2015.119" } }, { "@type": "CreativeWork", "@id": "", "name": "A new section (Begonia sect. Oligandrae sect. nov.) and a new species (Begonia pentandra sp. nov.) in Begoniaceae from New Guinea", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.197.1.4" } }, { "@type": "CreativeWork", "@id": "", "name": "Further discoveries in the ever-expanding genus Begonia (Begoniaceae): fifteen new species from Sumatra", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2015.167" } }, { "@type": "CreativeWork", "@id": "", "name": "Three New Species of Begonia Endemic to the Puerto Princesa Subterranean River National Park, Palawan", "identifier": [ { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201507-201507290029-201507290029-c1-14" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/s40529-015-0099-1" } ], "sameAs": "" }, { "@type": "CreativeWork", "@id": "", "name": "Memecylon pseudomegacarpum M.Hughes (Melastomataceae), a new species of tree from Peninsular Malaysia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.5852/ejt.2013.56" } }, { "@type": "CreativeWork", "@id": "", "name": "Recircumscription of Begonia sect. Baryandra (Begoniaceae): evidence from molecular data", "identifier": [ { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201309-201401170003-201401170003-70-74" }, { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1186/1999-3110-54-38" } ], "sameAs": "" }, { "@type": "CreativeWork", "@id": "", "name": "A new species and new combinations of Memecylon in Thailand and Peninsular Malaysia", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.11646/phytotaxa.66.1.2" } }, { "@type": "CreativeWork", "@id": "", "name": "A NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM PENINSULAR THAILAND", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428612000078" } }, { "@type": "CreativeWork", "@id": "", "name": "Pollen morphology of Begonia L. (Begoniaceae) in Nepal", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3329/bjpt.v19i2.13134" } }, { "@type": "CreativeWork", "@id": "", "name": "West to east dispersal and subsequent rapid diversification of the mega-diverse genus Begonia (Begoniaceae) in the Malesian archipelago", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1111/j.1365-2699.2011.02596.x" } }, { "@type": "CreativeWork", "@id": "", "name": "NINE NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SOUTH AND WEST SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428611000072" } }, { "@type": "CreativeWork", "name": "Begonia blancii (sect. Diploclinium, Begoniaceae), A New Species Endemic to the Philippine Island of Palawan", "identifier": { "@type": "PropertyValue", "propertyID": "other-id", "value": "al:1817406x-201104-201106150032-201106150032-203-209" }, "sameAs": "" }, { "@type": "CreativeWork", "@id": "", "name": "BEGONIA SECTION PETERMANNIA (BEGONIACEAE) ON PALAWAN (PHILIPPINES), INCLUDING TWO NEW SPECIES", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005307" } }, { "@type": "CreativeWork", "@id": "", "name": "TWO NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SOUTH SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005484" } }, { "@type": "CreativeWork", "@id": "", "name": "TWO NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM CENTRAL SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428609005320" } }, { "@type": "CreativeWork", "@id": "", "name": "BEGONIA VARIPELTATA(BEGONIACEAE): A NEW PELTATE SPECIES FROM SULAWESI, INDONESIA", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s096042860800509x" } }, { "@type": "CreativeWork", "@id": "", "name": "BEGONIA CLADOTRICHA (BEGONIACEAE): A NEW SPECIES FROM LAOS", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428607000777" } }, { "@type": "CreativeWork", "@id": "", "name": "FOUR NEW SPECIES OF BEGONIA (BEGONIACEAE) FROM SULAWESI", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428606000588" } }, { "@type": "CreativeWork", "@id": "", "name": "A Phylogeny of Begonia Using Nuclear Ribosomal Sequence Data and Morphological Characters", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1600/0363644054782297" } }, { "@type": "CreativeWork", "@id": "", "name": "Population genetic structure in the endemic Begonia of the Socotra archipelago", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1016/s0006-3207(02)00375-0" } }, { "@type": "CreativeWork", "@id": "", "name": "A NEW ENDEMIC SPECIES OF BEGONIA (BEGONIACEAE) FROM THE SOCOTRA ARCHIPELAGO", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1017/s0960428602000082" } }, { "@type": "CreativeWork", "@id": "", "name": "Isolation of polymorphic microsatellite markers for Begonia sutherlandii Hook. f.", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1046/j.1471-8286.2002.00201.x" } }, { "@type": "CreativeWork", "@id": "", "name": "Polymorphic microsatellite markers for the Socotran endemic herb Begonia socotrana", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.1046/j.1471-8286.2002.00185.x" } }, { "@type": "CreativeWork", "@id": "", "name": "Distribution Patterns of Begonia species in the Nepal Himalaya", "identifier": { "@type": "PropertyValue", "propertyID": "doi", "value": "10.3126/botor.v7i0.4386" } } ] }, "url": [ "", "" ], "identifier": [ { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "845425" }, { "@type": "PropertyValue", "propertyID": "Loop profile", "value": "826408" } ] }

This gives us a list of Mark's publications from ORCID. If there aren't any publications listed, the @reverse property is empty. Note that @reverse is a JSON-LD trick that enables the JSON-LD document to include not only things linked from Mark's ORCID id (e.g., his name and affiliation) but also things his ORCID id is linked to (e.g., that he is the author of works such as

I will still be generating my own linked data from ORCID for now as I rely on knowing the order of authorship for some of my work (e.g., "Reconciling author names in taxonomic and publication databases", and I want to be able to further process ORCID data (e.g., looking for missing DOIs), but the fact that ORCID are making JSON-LD available is going to simplify a lot of data integration tasks in the future.