Wednesday, August 26, 2020

Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu


I stumbled across this tweet yesterday (no doubt when I should have been doing other things), and disappeared down a rabbit hole. Emerging, I think the trip was worth it.

 

Markdown wikis


Among the tools listed by @zackfan01 were Obsidian and Roam, neither of which I heard of before. Both Obsidian and Roam are pitched as "note-taking" apps, they are essentially personal wikis where you write text in Markdown and use [[some text goes here]] to create links to other pages (very like a Wiki). Both highlight backlinks, that is, clearly displaying "what links here" on each page, making it easy to navigate around the graph you are creating by linking pages. Users of Obsidian share these graphs in Discord, rather like something from Martin MacInnes' novel "Gathering Evidence". Personal wikis have been around for a long time, but these apps are elegantly designed and seem fun to use. Looking at these apps I'm reminded of my earlier post Notes on collections, knowledge graphs, and Semantic Web browsers where I moaned about the lack of personal knowledge graphs that supported inference from linked data. I'm also reminded of the Blue Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and opportunities lost where I constructed an interactive tool to navigate BBC data on species and their ecology (you can see this live at https://rdmpage.github.io/bbc-wildlife/www/), and the fun to be had from simply being able to navigate around a rich set of links. I imagine these Markdown-based wikis could be a great way further explore these ideas.


 

Personal and global knowledge graphs


Then I began thinking about what if the [[page links]] in these personal knowledge graphs were not just some text but, say, a Wikidata identifier (of the form "Qxxxxx")? Imagine that if you were writing notes on say, a species, you could insert the Wikidata Qid and you would get a pre-populated template that comes with some facts from Wikidata, and you could then use that as a starting point (see for example Toby Hudson Entity Explosion I discussed earlier). Knowing that more and more scholarly papers are being added to Wikidata, this means you could also add bibliographic citations as Qids, fetching all the necessary bibliographic information on the fly from Wikidata. So your personal knowledge graph intersects with the global graph.

 

Roam


Now, I've not used Roam, but anyone who has is likely to balk at my characterisation of it as "just" a Markdown wiki, because there's more going on here. The Roam white paper talks about making inferences from the text using reasoning or belief networks, although these features don't seem to have had much uptake. But what really struck me as I explored Roam was the notion of not just linking to pages using the [[ ]] syntax, but also linking to parts of pages (blocks) using (( )). In the demo of Roam there are various essays, such as Paul Graham's The Refragmentation, and each paragraph is an addressable block that can be cited independently of the entire essay. Likewise, you can see what pages cite that block.


 Now in a sense these are just like fragment identifiers that we can use to link to parts of a web page, but there's something more here because these fragments are not just locations in a bigger document, they are the components of the document.

Xanadu


This strikes me as rather like Ted Nelson's vision of Xanadu, where you could cite any text at any level of granularity, and that text would be incorporated into the document you were creating via transclusion (i.e., you don't include a copy of the text, you include the actual text). In the context of Roam, this means you have the entire text you want to cite included in the system, so you can then show chunks of it and build up a network of ideas around each chunk. This also means that the text being worked on becomes part of the system, rather than remaining isolated, say, as a PDF or other representation. This also got me thinking about the Plazi project, where taxonomic papers are being broken into component chunks (e.g., figures, taxonomic descriptions, etc.) and these are then being stored in various places and reassembled - rather like Frankenstein's monster - in new ways, for example in GBIF (e.g., https://www.gbif.org/species/166240579) or Species-ID (see doi:10.3897/zookeys.90.1369 ). One thing I've always found a little jarring about this approach is that you lose the context of the work that each component was taken from. Yes, you can find a link to the original work and go there, but what if you could seamlessly click on the paragraph or figure see them as part of the original article? Imagine we had all the taxonomic literature available in this way, so that we can cite any chunk, remix it (which is a key part of floras and other taxonomic monographs), but still retain the original context?

Summary


To come back full circle, in some ways tools like Obsidian and Roam are old hat, we've had wikis for a while, the idea of loading texts into wikis is old (e.g., Wikisource), backlinks are nothing new, etc. But there's something about seeing clean, elegant interpretations of these ideas, free of syntax junk, and accompanied by clear visions of how software can help us think. I'm not sure I will use either app, but they have given me a lot of food for thought.

Tuesday, August 25, 2020

Entity Explosion: bringing Wikidata to every website

A week ago Toby Hudson (@tobyhudson) released a very cool Chrome (and now Firefox) extension called Entity Explosion. If you install the extension, you get a little button you can press to find out what Wikidata knows about the entity on the web page you are looking at. The extension works on web sites that have URLs that match identifiers in Wikidata. For example, here it is showing some details for an article in BioStor (https://biostor.org/reference/261148). The extension "knows" that this article is about the Polish arachnologist Wojciech Staręga.



But this is a tame example, see what fun Dario Taraborelli (@ReaderMeter) is having with Toby's extension:

There are some limitations. For instance, it requires that the web site URL matches the identifier, or more precisely the URL formatter for that identifier. In the case of BioStor the URL formatter URL is https://biostor.org/reference/$1 where $1 is the BioStor identifier stored by Wikidata (e.g., 261148). So, if you visit https://biostor.org/reference/261148 the extension works as advertised.

However, identifiers that are redirects to other web sites, such as DOIs, aren't so lucky. A Wikidata item with a DOI (such as 10.1371/JOURNAL.PONE.0133602) corresponds to the URL https://doi.org/10.1371/JOURNAL.PONE.0133602, but if you click on that URL eventually you get taken to https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0133602, which isn't the original DOI URL (incidently this is exactly how DOIs are supposed to work).

So, it would be nice if Entity Explosion would also read the HTML for the web page and attempt to extract the DOI from that page (for notes on this see https://www.wikidata.org/wiki/Wikidata_talk:Entity_Explosion#Retrieve_identifiers_from_HTML_<meta>_tag), which means it would work on even more webs sites for academic articles.

Meantime, if you use Chrome of Firefox as your browser, grab a copy and discover just how much information Wikidata has to offer.

Workshop On Open Citations And Open Scholarly Metadata 2020 talk

I'm giving a short talk at the Workshop On Open Citations And Open Scholarly Metadata 2020, which will be held online on September 9th. In the talk I touch on citation patterns in the taxonomic literature, the recent Zootaxa impact factor story, and mention a few projects I'm working on: To create the presentation I played around with mmhmm, which (pretty obviously) I still need to get the hang of...

Anyway, video below:

 

Friday, August 21, 2020

Taxonomic concepts: a possible way forward

Reading the GitHub issue Define objective rules for taxon concept identity referred to by Markus Döring in a comment on a previous post, I'm once again struck by the unholy mess generated by any discussion of "taxonomic concepts". The sense of déjà vu is overwhelming. What drives me to distraction is how little of this seems to be directed at solving actual problems that biologists have, which are typically things like "what does this random name that I've come across refer to?" and "will you please stop changing the damn names!?".

One thing that's also struck me is the importance of stable identifiers for species, that is, identifiers that are stable even in the face of name changes. If you have that, then you can talk about changes in classification, such as moving a species from one genus to another.

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

In the diagram above, the species being moved from Sasia to Verreauxia has the same identifier ("afrpic1") regardless of what genus it belongs to. This enables us to easily determine the differences between classifications (and then link those changes to the evidence supporting the change). I find it interesting that projects that manage large classifications, such as eBird and the Reptile database use stable species identifiers (either externally or internally). If you are going to deal with classifications that change over time you need stable identifiers.

So I'm beginning to think that perhaps the single most useful thing we could do as a taxonomic database community is to mint stable identifiers for each unique, original species name. These could be human readable, for example the species epithet plus author name plus year, suitably cleaned up (e.g., all lower case). So, our species could be "sapiens-linnaeus-1758". This sort of identifier is inspired by the notion of uninomial nomenclature:

If the uninomial system is not accepted, or until it is, I see no hope of ever arriving at a really stable nomenclature. - Hubbs (1930)

For more reading, see for example

  • Cantino, D. P., Bryant, H. N., Queiroz, K. D., Donoghue, M. J., Eriksson, T., Hillis, D. M., & Lee, M. S. Y. (1999). Species Names in Phylogenetic Nomenclature. Systematic Biology, 48(4), 790–807. doi:10.1080/106351599260012
  • Hubbs, C. L. (1930). SCIENTIFIC NAMES IN ZOOLOGY. Science, 71(1838), 317–319. doi:10.1126/science.71.1838.317
  • Lanham, U. (1965). Uninominal Nomenclature. Systematic Zoology, 14(2), 144. doi:10.2307/2411739
  • Michener, C. D. (1963). Some Future Developments in Taxonomy. Systematic Zoology, 12(4), 151. doi:10.2307/2411757
  • Michener, C. D. (1964). The Possible Use of Uninominal Nomenclature to Increase the Stability of Names in Biology. Systematic Zoology, 13(4), 182. doi:10.2307/2411777

Just to be clear I'm NOT advocating replacing binomial names with uninomial names (the references above are just to remind me about the topic), but approaches to developing uninomial names could be used to create simple, human-friendly identifiers. Oh, and hat tip to Geoff Read for the comment on an earlier post of mine that probably planted the seed that started me down this track.

So, imagine going to a web site and with the uninomial identifier being able to get the list of every variation on that name, including species names being in different genera (in other words, all the objective or homotypic synonyms of that name).

OK, nice, but what about taxa? Well the second thing I'd like to get is every (significant) use of that name, coupled with a references (i.e., a "usage"). These would include cases where the name is regarded as a synonym of another name. Given that each usage is dated (by the reference), we then have a timestamped record of the interpretation of taxa referred to by that name. Technically, what I envisage is that we are tracking nomenclatural types, that is, for a given species name we are returning every usage that refers to a taxon that includes the type specimen of that name.

We could imagine doing something trivial such as putting "/n/" before the identifier to retrieve all name variations, and "/t/" to retrieve all usages. One could have a suffix for a timestamp (e.g., "what was the state of play for this name in 1960?")

It seems that something like this would help cut through a lot of the noise around taxa. By itself, a list of names and references doesn't specify everything you might want to know about a taxon, but I suspect that some of the things taxonomists ask for (e.g., every circumscription, every set of "defining" characters, every pairwise relationship between every variation on a taxon's interpretation) are both unrealistic and probably not terribly useful.

For example, circumscriptions (defining a taxon by the set of things it includes) are often mentioned in discussions of taxon concepts, but in reality (at the species level) how many explicit circumscriptions do we have in the taxonomic literature? I'd argue that the circumscriptions that we do have the are the ones being generated by modern databases such as GBIF, iNaturalist, BOLD, and GenBank. These explicitly link specimens, photos, or sequences to a taxon (defined locally within that database, e.g. by an integer number), and in some cases are testable, e.g., BLAST a sequence to see if it falls in the same set of sequences. These databases have their own identifiers and notions of what comprises a taxon (e.g., based on community editing, automated clustering, etc.).

This approach of simple identifiers that link multiple name variations would support the name-based matching that is at the heart of matching records in different databases (despite the wailing that names ≠ taxa, this is fundamentally how we match things across databases). The availability of timestamp usages would enable us to view a classification at given point in time.

This needs to be fleshed out more, and I really want to explore the idea of edit scripts (or patch files) for comparing taxonomic classifications, and how we can use them to document the evidence for taxonomic changes. More to come...

Wednesday, August 19, 2020

Taxonomic concepts continued: All change at ALA and AFD

Continuing my struggles with taxa (see Taxonomic concepts continued: iNaturalist) I now turn to the Atlas of Living Australia (ALA) and the Australian Faunal Directory (AFD), which have perhaps the most fluid taxon identifiers ever. In 2018 I downloaded data from ALA and AFD and used it to create a knowledge graph ("Ozymandias", see GBIF Challenge Entry: Ozymandias and https://ozymandias-demo.herokuapp.com for the web interface to the knowledge graph).

One thing I discovered is that the taxon identifiers used by ALA change... a lot. It almost feels that every time I revisit Ozyamndias and compare it to the ALA things have changed. For example, here is the fly species Acupalpa albimanis (Kröber, 1914) which you can see at https://ozymandias-demo.herokuapp.com/?uri=https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06.





The "https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06" part of the Ozymandias URL is the URL for this species in the Atlas of Living Australia. Well, it was at the time I built Ozymandias (2018). Now (19 August 2020), it is https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:69e1b774-9875-4ff1-ba20-5c4eeed866dc. If you put https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:02b0821a-abad-4996-aae0-741472c6ad06 into your web browser, you will get redirected to the new URL (under the hood you get a HTTP 302 response with the "Location" header value set to https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:69e1b774-9875-4ff1-ba20-5c4eeed866dc).

So, seemingly, our notion of what Acupalpa albimanis (Kröber, 1914) is has changed since 2018. In fact, ALA itself is out of date, because if you replace the "https://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:afd.taxon:" bit with "https://biodiversity.org.au/afd/taxa/" you get taken to the AFD site, which is the source of the data for ALA. But ALA says that "https://biodiversity.org.au/afd/taxa/69e1b774-9875-4ff1-ba20-5c4eeed866dc" is old news and is (as of 28 February 2020) "https://biodiversity.org.au/afd/taxa/6f91e39e-8d73-4133-901c-7f4ba1771e30":

So the identifier for this taxon keeps changing. For the life of me I can't figure out why. If I compare the CSV file I downloaded 23 February 2018 with one I downloaded today (19 August 2020, CSV file is available from https://biodiversity.org.au/afd/taxa/THEREVIDAE/names/csv) the only differences are the time stamps for NAME_LAST_UPDATE and TAXON_LAST UPDATE, and the UUIDs used for NAME_GUID, TAXON_GUID, PARENT_TAXON_GUID, and CONCEPT_GUID fields. So, the administrivia has changed. Other than that, the data in the two files for this fly are identical, so why the change in identifiers? It seems bizarre to create identifiers that regularly change (and then ha=ve to maintain the associated redirects to try and keep the old identifiers functional) when the data itself seems unchanged.

Now, AFD isn't the only project to regularly change identifiers, an older version of the Catalogue of Life also did this, although it wasn't always clear to me how they did that - see Catalogue of Life and LSIDs: a catalogue of fail and the paper "Identifying and relating biological concepts in the Catalogue of Life" https://doi.org/10.1186/2041-1480-2-7.

As I disappear further down this rabbit hole (why oh why did I start doing this?) I'm beginning to suspect part of the issue here is versioning (argh!), and what we are seeing is various different ways people are trying to cope with that, the problem of what identifiers should change, and how and when, and what part of the information about a taxon a given taxon identifier should point to (the name, a use of that name, the underlying concept, the database record that tracks a taxon, the latest version of that record, etc.). Given that different databases tackle these issues differently (and not always consistently), the notion that we can easily map these identifiers to each other, and to third party identifiers such as Wikidata seems a bit, um, optimistic...

Monday, August 17, 2020

Taxonomic concepts continued: iNaturalist

Following on from my earlier post ("Taxonomic concepts for dummies"), Beckett Sterner commented:

Maybe one productive use case would be to look at what it would take for wikidata to handle taxa (=name+concept) in a way that didn't lose relevant taxonomic information when receiving content from platforms like iNaturalist that has a fairly sophisticated strategy https://www.inaturalist.org/pages/taxon_frameworks

iNaturalist is interesting, but I'm not convinced that it is internally consistent. As a quick rule of thumb, I'm looking for patterns of how name changes relate to taxon identifier changes. For example, we can have cases where a database retains the same taxon identifier (the columns) even if names (rows) change (such as eBird or Avibase). For example, if we move a species from one genus to the next, the name changes but (arguably) the taxon itself doesn't (it's still the same set of organisms, just moved to a different part of our classification or, if you like, tagged differently).

Taxa
Names    
   

Then we can have cases like this, where the name (row) is the same but the taxon changes. This might be where we split a taxon in two, and one remaining part retains the original name. So you could argue that the taxon has changed (i.e., in composition) even if the name hasn't.

Taxa
Names    
   

Now, many taxonomic databases seem to something different: every time the name changes we have a new identifier, even if the taxon bearing that name hasn't changed (i.e., it has the same set of organisms as before), so we get a name change and a taxon identifier change:

Taxa
Names    
   

Because we have databases that use different approaches to how they use name and taxon identifiers, life can get complicated, especially for projects such as Wikidata that try and synthesise information across all of these databases.

I haven't looked in detail at iNaturalist, but I have found cases of both

Taxa
Names    
   

and

Taxa
Names    
   

In some cases iNaturalist will change a taxon idenifier even if the name remains the same. For example, the "Thrush-like Schiffornis" Schiffornis turdina https://www.inaturalist.org/taxa/8793 has been split into five taxa, one of which bears the same scientific name (Schiffornis turdina https://www.inaturalist.org/taxa/513975). Given that the composition of Schiffornis turdina has changed, there is an argument to be made that its taxon identifier should change, which is what iNaturalist does.

So, it looks like iNaturalist is using its "taxa" identifiers to identify taxa, but then have cases such as the transfer of the African piculet Sasia africana https://www.inaturalist.org/taxa/18393 to Verreauxia africana https://www.inaturalist.org/taxa/792894, or the transfer of Heraclides rumiko https://www.inaturalist.org/taxa/428606 to Papilio rumiko https://www.inaturalist.org/taxa/509627. In both cases nothing has changed about those species, yet the identifiers have changed (for an example of a "true" taxon identifier, note that NCBI has the same identifier for Sasia africana/Verreauxia africana and for Heraclides rumiko/Papilio rumiko).

My sense is that part of the problem is that we are trying to overload identifiers, which in and of themselves don't tell us much. For example, some might argue that any change in an entity requires a change in the entity's identifier, because the underlying thing has changed. Others might argue that such a change risks making things harder to find (for example, how do we now connect the earlier version of a thing with the newer version, given that the identifier has changed?). In the case of taxonomy, I think we could possibly avoid some of this grief if we acknowledge that names and identifiers have their limits, and that we should decouple them from trying to track changes in the things they point too. Rather, what we could really do with is a timestamped versioning system where we can ask "OK, in 1960, what did this genus look like?". Likewise, when looking at a system such as Wikidata, we shouldn't expect to have a complete view of every taxonomic opinion ever held. But we could aim for a current "snapshot".

Update

I asked a question about this on the iNaturalist forum and it appears that iNaturalist treats taxon ids as essentially rows in a database, each time a name gets added you get a new integer id, and if you decide that a taxon has fundamentally changed (e.g., a species is split into two or more taxa) then you add that new taxon, generating a new integer id.

Monday, August 10, 2020

Australian museums and ALA

Bob mesibovThe following is a guest post by Bob Mesibov.

The Atlas of Living Australia (ALA) adds "assertions" to Darwin Core occurrence records. "Assertions" are indicators of particular data errors, omissions and questionable entries, such as "Coordinates are transposed", "Geodetic datum assumed WGS84" and "First [day] of the century".

Today (8 August 2020) I looked at assertions attached to records in ALA for non-fossil animals in the Australian State museums. There were 62 occurrence record collections from the seven museums (I lumped the two Tasmanian museums together), with 45 different assertions. I then calculated assertions per record for each collection. The worst performer was the Queensland Museum Porifera collection (3.84 ass/rec), and tied for best were the Museums Victoria Herpetology and Ichthyology collections (1.09 ass/rec).

I also aggregated museum collections to build a kind of league table by State:

The clear winner is Museums Victoria.

But how well do ALA's assertions measure the quality of data records? Not all that well, actually.

  • The tests used to make the assertions generate false positives and false negatives, although at a low rate
  • The tests aren't independent, so that a single data error can "smear" across several assertions
  • The tests ignore errors and omissions in DwC fields that many data users would consider important

ALA's assertions also have a strong spatial/geographical bias, with 23 of the 45 assertions in my sample dataset saying something about the "where" of the occurrence. Looking just at those 23 "where" assertions, the museums league table again shows Museums Victoria ahead, this time by a wide margin:

ALA is currently working on better ways for users to filter out records with selected assertions, in what's misleadingly called a "Data Quality Project". The title is misleading because the overall quality of ALA's holdings doesn't improve one bit. Getting data providers to fix their data issues would be a more productive way to upgrade data quality, but I haven't seen any evidence that Australian museums (for example) pay much attention to ALA's assertions. (There are no or minimal changes in assertion totals between data updates.)

It's been pointed out to me that that museum and herbarium records amount to only a small fraction of ALA's ca 90 million records, and that citizen scientists are growing the stock of occurrence records far faster than institutions do. True, and those citizen science records are often of excellent quality (see https://www.datafix.com.au/BASHing/2020-02-05.html). However, citizen science observations are strongly biased towards widespread and common species. ALA's records for just six common Australian birds (5,072,599 as of 8 August 2020; https://dashboard.ala.org.au/) outnumber all the museum animal records I looked at in the assertion analysis (4,669,508).

In my humble view, the longer ALA's institutional data providers put off fixing their mistakes, the less valuable ALA becomes as a bridge between biodiversity informatics and biodiversity science.