Friday, June 21, 2019

Messages from Melbourne: Towards linking all the things

I'm doing some work with Nicole Kearney (@nicolekearney) at the Melbourne Museum on the general theme of "linking all the things". It's the end of the first full week we've had, so here's a quick update of what we've been up to.

Brainstorming

The things we want to do are being captured as a project on GitHub. This is where we come up with ideas, comment on then, then try to figure out which ones can be done. So far there are three things we've made a serious start on.

Unpaywall

Unpaywall is a project by Impactstory. It is sort of a Sci-Hub without the legal issues (for the record, I think Alexandra Elbakyan's work on Sci-Hub is nothing short of heroic). Unpaywall scans open access archives for legal, freely available versions of articles and makes them easy to find. If you have Firefox or Chrome you can get a plugin that lights up if the paywall article you're looking at has a free version somewhere else.
Nicole has long wanted the BHL to provide data to Unpaywall, because BHL has open access versions of many papers relevant to taxonomy and biodiversity more broadly defined. After a bit of digging we figured out that Unpaywall didn't have access to BHL's data, so we've set about fixing that. We've got the data harvested, but we're still waiting for Unpaywall to process that data. So, for now, we're still waiting for the little green light to appear on pages such as this one: https://doi.org/10.1080/00222932208632640.


Adding taxonomic literature to Atlas of Living Australia

Part of "linking all the things" is making the taxonomic literature a first class citizen of biodiversity databases. It is frankly embarrassing to see how much better the scientific literature is handled by projects such as Wikipedia than scientific databases such as GBIF and the ALA. We've decided to try and do something about this by showing how easily the literature could be embedded into the existing ALA web site. Nicole crafted a mockup of the ALA names tab, and I wrote some code to make it "live". For example, if you click on this link you will see a list of publications for Pauropsalta herveyensis Owen & Moulds, 2016. Note that we have DOIs and links to BHL where ever possible (and we use Unpaywall's API to flag whether an article with a DOI is freely available). We want this literature (the primary evidence for what we know about a species) to be visible and accessible. The demo is powered by my Ozymandias project, but we hope to work out a mechanism for delivering the mapping between taxa and literature to ALA (and, indeed, anyone else) as a dataset.
Because Ozymandias only has data for animals, we've had to exclude plants from this demo. I'm frantically trying to figure out how to work with data in Australia's plant name databases to resolve this. I'm discovering that never mind having more than one name for the same species, taxonomists also delight in having many different ways of representing taxonomic information in their databases. So, plants will be a challenge.


Mapping taxonomists to ORCID and Wikidata

One reason for adding literature to taxonomic databases is to make the work of taxonomists more visible. One way to do this is to move beyond using only "dumb strings" as people names and linking taxonomists to their ORCIDs and to entries in Wikidata (this is something I touched on in Ozymandias, and David Shorthouse is doing on an epic scale in Bloodhound). We're playing with the idea of being able to generate a list of active taxonomists in Australia, linked to their identifiers and publications, solely based on querying Wikidata. The first step is to try and automate the initial mapping between taxonomists and Wikidata as much as possible, we've only just started looking at this.

Summary

It is early days, and we're still identifying things we could work on. As always, there are so manythings which could be done, we're hoping we can make progress on at least some of these in the next few weeks.

Tuesday, May 28, 2019

Frankenplace, geospatial search, and discrete global grid systems


Quick note on Frankenplace, a cool search tool that displays the geographic distribution of documents that match the user's query as a heatmap. Details of how the tool works are given in:
B. Adams, G. McKenzie, and M. Gahegan (2015) Frankenplace: Interactive Thematic Mapping for Ad Hoc Exploratory Searching. 24th International World Wide Web Conference (WWW 2015), http://dx.doi.org/10.1145/2736277.2741137
At the heart of the method is a discrete global grid that divides the world up into small areas of the same size. Topics are then geographically indexed, so that when a user searches, say, for "ebola", areas relevant to that query are highlighted (in this case, areas in Africa). It's striking example of querying data geographically, and one which I hope to explore further in the context of BHL and BioStor.


Update

I've put some notes on various discrete global grid systems in a repo on GitHub: RDF and discrete global grid systems.

Ozymandias meets Wikipedia, with notes on natural language generation

I've tweaked Ozymandias to now include short natural language summaries (snippets) for various taxa. This makes the output a little more friendly and informative. For example, here's a snippet from the page on Cephalodesmius, a dung beetle that makes its own dung.


These snippets come from Wikipedia, well actually, from the DBpedia project. Behind the scenes I have a script that takes the GBIF taxon id for an ALA taxon (if it exists), queries Wikidata for the corresponding taxon and any associate identifiers of interest, and if there's a link to an English language Wikipedia page I do a quick SPARQL query to DBpedia to retrieve the snippet of text. At some point all of this could be sped up by adding the relevant data to the triple store and doing the query locally but for now it works well enough.

Of course, many snippets are little more than stubs, e.g. the snippet for another dung beetle genus Diorygopyx doesn't tell us much more than we can get from the information already displayed.



But having a text summary still seems worthwhile, which raises the question of what to do when Wikipedia doesn't know anything about a taxon? Obviously, we could start editing Wikipedia to flesh out its content, but that will take a while to filter into databases such as DBpedia. Another approach is to generate snippets from the triple store itself, in other words, generate natural language summaries from structured data. For example, we could generate summaries such as "Diorygopyx is a genus of Scarabaeidae or scarab beetles in the superfamily Scarabaeoidea" fairly easily from knowing the taxonomic hierarchy and a few common names. But we could also do more. In browsing Ozymandias I'm struck at times by how much our knowledge of one taxon depends on a major piece of taxonomic work, often done some time ago. For example, The Australian Crickets (Orthoptera: Gryllidae). Academy of Natural Sciences of Philadelphia, Monograph 22 by Otte and Alexander (1983) is a monumental taxonomic monograph, and many Australian cricket genera had most (or all) of their species described in that work. Imagine having a snippet that mentioned that (e.g., "Most species in this genus were described in 1983, no species have been discovered since."). That would give the reader some useful information, and perhaps also prompt them to ask "so, why haven't any more species been described?".

I think there's scope here to make the output from triple stores (and other databases) more approachable using natural language generation. This is obviously a big area, and there are some very sophisticated approaches for outputting very natural language (think chatbots), perhaps the most striking example of which is Google Duplex.


But we don't need quite this level of sophistication, something using much simpler techniques (e.g., nalgene-js) would probably be enough. Armed with some basic facts from the triple store, and some simple templates, we could probably generate some useful text snippets for many taxa in Ozymandias, and indeed for other entities. For example, David Shorthouse is outputting simple text summaries of the contribution of taxonomists to specimen collection and identification:

Imagine extending this to take into account publications, geography, etc. I think there's lots of scope here for moving beyond just displaying data and trying to generate human-friendly summaries of data.

Wednesday, April 10, 2019

Ozymandias: A biodiversity knowledge graph published in PeerJ

My paper "Ozymandias: A biodiversity knowledge graph" has been published in PeerJ https://doi.org/10.7717/peerj.6739
The paper describes my entry in GBIF's 2018 Ebbe Nielsen Challenge, which you can explore here. I tweeted about its publication yesterday, and got some interesting responses (and lots of retweets, thanks to everyone for those).

Carl Boettiger (@cboettig) asked where the triples were, as did Kingsley Uyi Idehen (@kidehen). Doh! This is one thing I should have done as part of the paper. I've uploaded the triples to Zenodo, you can find them here: https://doi.org/10.5281/zenodo.2634326.

Donat Agosti (@myrmoteras) complained that my knowledge graph ignored a lot of available information, which is true in the sense that I restricted it to a core of people, publications, taxa, and taxonomic names. The Plazi project that Donat champions extracts, where possible, lots of detail from individual publications, including figures, text blocks corresponding to taxonomic treatments, and in some cases geographic and specimen information. I have included some of this information in Ozymandias, specifically figures for papers where they are available. For example, Figure 10 from the paper "Australian Assassins, Part I: A review of the Assassin Spiders (Araneae, Archaeidae) of mid-eastern Australia":



This figure illustrates Austrarchaea nodosa (Forster, 1956), and Plazi has a treatment of that taxon: http://treatment.plazi.org/id/1072F469192A5BA015A1AA70A36E2C92. This treatment comprises a series of text blocks extracted from the paper, so there is not a great deal I can do with this unless I want to parse the text (e.g., for geographical coordinates and specimen codes). So yes, there is RDF (see http://treatment.plazi.org/GgServer/rdf/1072F469192A5BA015A1AA70A36E2C92) but it adds little to the existing knowledge graph.
To be fair, for some treatments in Plazi are a lot richer, for example http://tb.plazi.org/id/A94487F7E15AFFA5FF682EE9FEB45F2C which has references, geographical coordinates, and more. What would be useful would be an easy way to explore Plazi, for example, if the RDF was dumped into a triple store where we could explore it in more detail. I hope to look into this in the coming weeks.

Sunday, March 24, 2019

Where is the damned collection? Wikidata, GrBio, and a global list of all natural history collections

One of the things the biodiversity informatics community has struggled to do is come up with a list of all natural history collections (Taylor, 2016). Most recently GrBio attempted to do this, and appealed for community help to curate the list (Schindel et al., 2016), but this did not emerge, and at the time of writing GrBio is moribund. GBIF has obtained GrBio's data and is now hosting it (GBIF provides new home for the Global Registry of Scientific Collections) but the problem of curation remains. Furthermore, GrBio is not the only contender for a global list of collections, the NCBI has their own list (Sharma et al. 2018).
When Schindel et al. came out I suggested that a better way forward was to use Wikidata as the data store for basic information on collections (see GRBio: A Call for Community Curation - what community?). David Shorthouse's work on linking individual researchers to the specimens they have collected (Bloodhound) has motivated me to revisit this. One of the things David is wants to do is link the work of individuals to the institutions that host the specimens they work on. For individuals the identifier of choice is ORCID, and many researcher's ORCID profiles have identifiers for the institution they work at. For example, my ORCID profile https://orcid.org/0000-0002-7101-9767 states that I work at Glasgow University which has the Ringgold number of 3526. What is missing here is a way to go from the institutional identifiers we use for specimens (e.g., abbreviations like "MCZ" for the Museum of Comparative Zoology) to identifier such as Ringgold that organisations such as ORCID use.
It turns out that many institutions with Ringgold numbers (and other identifiers, such as Global Research Identifier Database or GRID) are in Wikidata. So, if we could map museum codes (institutionCode in Darwin Core terms) to Wikidata, then we can close the loop and have common institutional identifiers for both where individuals are employed and the institutions that house the collections that they work on.
Hence, it seems to me that using Wikidata as the basis for a global catalogue of institutions housing natural history collections makes a lot of sense. Many of these institutions are already in Wikidata, and the community of Wikidata editors dwarfs the number of people likely to edit a domain-specific database (as evidenced by the failure of GrBio's call for community engagement with its database). Furthermore, Wikidata has a sophisticated editing interface, with support for multiple langages and adding the provenance of individual data entries.
To get a sense of what is already in Wikidata I've built a small tool called Where is the damned collection?. It's a simple wrapper around a SPARL query to Wikidata, and the display is modelled on the "knowledge panel" that you often see to the right of Google's search results. If you type in the acronym for an institution (i.e., the institutionCode) the tools attempts to find it.




Here are some more examples:

There are some challenges to using Wikidata for this purpose. To date there has been little in the way of a coordinated effort to add natural history collections. There are 121 institutions that have a Index Herbariorum code (Property P5858) associated with their Wikidata records, you can see a list here. There is also a property for Biodiversity Repository ID which supports the syntax GrBio used to create unique institutionCode's even when multiple institutions used the same code. This has had limited uptake so far only being a property for five Wikidata items.
However, there are more museums and herbaria in Wikidata. For example, if we search for herbaria, natural history museums, and zoological museums we find 387 institutions. This query is made harder than it should because there are multiple types that can be used to describe a natural history collection and they query only uses three of them.
Another source of entries in Wikidata is Wikispecies. There are two pages (Repositories (A–M) and Repositories (N–Z)) that list pages corresponding to different institutionCodes. I have harvested these and found 1298 of these in Wikidata. This indicates that a good fraction of the 7,097 institutions listed by GrBio already have a presence in Wikidata. At the same time, it rather complicates the task of adding institutions to Wikidata as we need to figure out how many of these stub-like entries based on institutionCodes represent institutions already in Wikidata. There are also https://en.wikipedia.org/wiki/List_of_herbaria and natural history museums on Wikipedia that can also be harvested and cross-referenced with Wikidata.
So, there is a formidable data cleaning task ahead, but I think it's worth contemplating. One thing I find particularly interesting are the links to social media profiles, such as Twitter, Facebook, and Instagram. This gives another perspective on these institutions - in a sense this is digitisation of experiences that one can have at those institutions. These profiles are also often a good sources of data (such as geographic location and address). And they give a foretaste of what I think we can do. Imagine the entire digital footprint of a museum or herbarium being linked together in one place: the social media profiles, the digitised collections, the publications for which it is a publisher, its membership in BHL, JSTOR, GBIF, and other initiatives, and so on. We could start to get a better sense of the impact of digitisation - broadly defined - on each institution.
In summary, I think the role of Wikidata in cataloguing collections is worth exploring, and there's a discussion of this idea going on at the GBIF Community Forum. It will be interesting to see where this discussion goes. Meantime, I'm messing about developing with some scripts to see how much of the data mapping and cleaning process can be automated, so that tools like Where is the damned collection? become more useful.
References
  • Schindel, D., Miller, S., Trizna, M., Graham, E., & Crane, A. (2016). The Global Registry of Biodiversity Repositories: A Call for Community Curation. Biodiversity Data Journal, 4, e10293. doi:10.3897/bdj.4.e10293
  • Sharma, S., Ciufo, S., Starchenko, E., Darji, D., Chlumsky, L., Karsch-Mizrachi, I., & Schoch, C. L. (2018). The NCBI BioCollections Database. Database, 2018. doi:10.1093/database/bay006
  • Taylor, M. A. (2016). “Where is the damned collection?” Charles Davies Sherborn’s listing of named natural science collections and its successors. ZooKeys, 550, 83–106. doi:10.3897/zookeys.550.10073

Wednesday, December 05, 2018

Biodiversity data v2Glasgow University's Institute of Biodiversity, Animal Health & Comparative Medicine, where I'm based, hosts Naturally Speaking featuring "cutting edge research and ecology banter". Apparently, what I do falls into that category, so Episode 65 features my work, specifically my entry for the 2018 GBIF Challenge (Ozymandias). The episode page has a wonderful illustration by Eleni Christoforou which captures the idea of linking things together very nicely. Making the podcast was great fun, thanks to the hosts Kirsty McWhinnie and Taya Forde. Let's face it, what academic doesn't love to talk about their own work, given half a chance? I confess I'm happy to talk about my work, but I haven't had the courage yet to listen to the podcast.

Ozymandias: A biodiversity knowledge graph available as a preprint on Biorxiv

LwyH1HFe 400x400I've written up my entry for the 2018 GBIF Challenge ("Ozymandias") and posted a preprint on Biorxiv (https://www.biorxiv.org/content/early/2018/12/04/485854). The DOI is https://doi.org/10.1101/485854 which, last time I checked, still needs to be registered.

The abstract appears below. I'll let the preprint sit there for a little while before I summon the enthusiasm to revisit it, tidy it up, and submit it for publication.

Enormous quantities of biodiversity data are being made available online, but much of this data remains isolated in their own silos. One approach to breaking these silos is to map local, often database-specific identifiers to shared global identifiers. This mapping can then be used to con-struct a knowledge graph, where entities such as taxa, publications, people, places, specimens, sequences, and institutions are all part of a single, shared knowledge space. Motivated by the 2018 GBIF Ebbe Nielsen Challenge I explore the feasibility of constructing a "biodiversity knowledge graph" for the Australian fauna. These steps involved in constructing the graph are described, and examples its application are discussed. A web interface to the knowledge graph (called "Ozymandias") is available at https://ozymandias-demo.herokuapp.com.

Thursday, November 15, 2018

Geocoding genomic databases using GBIF

LwyH1HFe 400x400I've put a short note up on bioRxiv about ways to geocode nucleotide sequences in databases such as GenBank. The preprint is "Geocoding genomic databases using GBIF" https://doi.org/10.1101/469650.

It briefly discusses using GBIF as a gazetteer (see https://lyrical-money.glitch.me for a demo) to geocode sequences, as well as other approaches such as specimen matching (see also Nicky Nicolson's cool work "Specimens as Research Objects: Reconciliation across Distributed Repositories to Enable Metadata Propagation" https://doi.org/10.6084/m9.figshare.7327325.v1).

Hope to revisit this topic at some point, for now this preprint is a bit of a placeholder to remind me of what needs to be done.

Thursday, October 25, 2018

Taxonomic publications as patch files and the notion of taxonomic concepts

There's a slow-burning discussion on taxonomic concepts on Github that I am half participating in. As seems inevitable in any discussion of taxonomy, there's a lot of floundering about given that there's lots of jargon - much of it used in different ways by different people - and people are coming at the problem from different perspectives.

In one sense, taxonomy is pretty straightforward. We have taxonomic names (labels), we have taxa (sets) that we apply those labels to, and a classification (typically a set of nested sets, i.e., a tree) of those taxa. So, if we download, say, GenBank, or GBIF, or BOLD we can pretty easily model names (e.g., a list of strings), the taxonomic tree (e.g., a parent-child hierarchy), and we have a straightforward definition of the terminal taxa (leaves) or the tree: they comprise the specimens and observations (GBIF), or sequences (GenBank and BOLD) assigned to that taxon (i.e., for each specimen or sequence we have a pointer to the taxon to which it belongs).

Given this, one response to the taxonomic concept discussion is to simply ignore it as irrelevant, and we can demonstrably do a lot of science without it. I suspect most people dealing with GBIF and GenBank data aren't aware of the taxonomic concept issue. Which begs the question, why the ongoing discussion about concepts?

Perhaps the fundamental issue is that taxonomic classification changes over time, and hence the interpretation of a taxon can change over time. In other words, the problem is one of versioning. Once again, the simplest strategy to deal with this is simply use the latest version. In much the same way that most of us probably just read the latest version of a Wikipedia page, and many of us are happy to have our phone apps update automatically, I suspect most are happy to just grab the latest version and do some, you know, science.

I think taxonomic concepts really become relevant when we are aggregating data from sources where the data may not be current. In other words, where data is associated with a particular taxonomic name and the interpretation of that name has changed since the last time the data was curated. If the relationships of a taxon or specimen can be computed on the fly, e.g. if the data is a DNA barcode, then this issue is less relevant because we can simply re-cluster the sequences and discover where the specimen with that sequence belongs in a new classification. But for many specimens we don't have sufficient information to do this computation (this is one reason DNA barcodes are so useful, everything needed to determine a barcode's relationship is contained in the sequence itself).

To make this concrete, consider the genus Brookesia in GBIF (GBIF:2449310.

Screenshot 2018 10 25 11 43

According to Wikipedia Brookesia is endemic to Madagascar, so why does it appear on the African mainland? There are two records from Africa, Brookesia brookesia ionidesi collected in 1957 and Brookesia temporalis collected in 1926. Both represent taxa that were in the genus Brookesia at one point, but are now in different genera. So our notion of Brookesia has changed over time, but curation of these records has yet to catch up with that.

So, what would be ideal would be if we have a timestamped series of classifications so that we could go back in time and see what a given taxon meant at a given time, and then go forward to see the status of that taxon today. Having such a timestamped series is not a trivial task, indeed it may only be available in well studied groups. Birds are one such group, where each year eBird updates the current bird classification based on taxonomic activity over the previous year. As part of the Github discussion I posted visual "diff" between two bird classifications:

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

You can see the complete diff here, and the blog post Visualising the difference between two taxonomic classifications for details on the method.. The illustration above shows the movement of one species from Sasia to Verreauxia.

So, given two classifications we can compute the difference between them, and represent that difference as an "edit script" or operations to convert one tree into another. These edits are essentially what taxonomists do when they revise a group, they do things such as move species form one genus to another, merge some taxa, sink others into synonymy, and so on. So, taxonomy is essentially creating a series of edit files ("patches") to a classification. At a recent workshop in Ottawa Karen Cranston pointed out that the Open Tree of Life has been accumulating amendments to their classification and that these are essentially patch files.

Hence, we could have a markup language for taxonomic work that described that work in terms of edit operations that can then be automatically applied to an existing classification. We could imagine encoding all the bird taxonomy for a year in this way, applying those patches to the previous years' tree, and out pops the new classification. The classification becomes an evolving document under version control (think GitHub for trees). Of course, we'd need something to detect whether two different papers were proposing incompatible changes, but that's essentially a tree compatibility problem.

One way to store version information would be to use time-based versioned graphs. Essentially, we start with each node in the classification tree having a start date (e.g., 2017) and an open-ended end date. A taxonomic work post 2017 that, say, moved a species from one genus to another would set the end date for the parent-child link between genus and species, and create a new timestamped node linking the species to its new genus. To generate the 2018 classification we simply extract all links in the tree whose date range includes 2018 (which means the old generic assignment for the species is not included). This approach gives us a mechanism for automating the updating of a classification, as well as time-based versioning.

I think something along these lines would create something useful, and focus the taxonomic discussion on solving a specific problem.