Thursday, July 05, 2018

GBIF at 1 billion - what's next?

GBIF has reached 1 billion occurrences which is, of course, something to celebrate:

An achievement on this scale represents a lot of work by many people over many years, years spent developing simple standards for sharing data, agreeing that sharing is a good thing in the first place, tools to enable sharing, and a place to aggregate all that shared data (GBIF).

So, I asked a question:

My point is not to do this:

Rather it is to encourage a discussion about what happens when we have large amounts of biodiversity data. Is it the case that as we add data we simply enable more of the same kind of science, only better (e.g., more data for species distribution modelling), or do we reach a point where new things become possible?

Document

To give a concrete example, consider iNaturalist. This started out as a Masters project to collect photos of organisms on Flickr. As you add more images you get better coverage of biodiversity, but you still have essentially a bunch of pictures. But once you have LOTS of pictures, and those are labelled with species names, you reach the point where it is possible to do something much more exciting - automatic species identification. To illustrate, I recently took the photos below:

Large2 Large

Note the reddish tubular growths on the leaves. I asked iNaturalist to identify these photos and within a few seconds it came back with Eriophyes tiliae, the Red Nail Gall Mite. This feels like magic. It doesn't rely on complicated analysis of the image (as many earlier efforts at automated identification have done) it simply "knows" that images that look like this are typically of the galls of this mite because it has seen many such images before. (Another example of the impact of big data is Google Translate, initially based on parsing lots of examples of the same text in multiple languages.)

The "1 billion" number is not, by itself, meaningful. It's rather that I hope that while we're popping the champagne and celebrating a welcome, if somewhat arbitrary milestone, I'm hoping that someone, somewhere is thinking about whether biodiversity data on this scale enables something new.

Do I have answers? Not really, but here's one fairly small-scale example. One of the big challenges facing GBIF is getting georeferenced data. We spend a lot of time using a variety of tools and databases to convert text descriptions one collection localities into latitude and longitude. Many of these descriptions include phrases such as "5 mi NW of" and so we've developed parsers to attempt to make sense of these. All of these phrases and the corresponding latitude and longitude coordinates have ended up in GBIF. Now, this raises the possibility that after a point, pretty much any locality phrase will be in GBIF, so a way to georeference a locality is simply to search GBIF for that locality and use the associated latitude and longitude. GBIF itself becomes the single best tool to georeference specimen data. To explore this idea I've built a simple tool on glitch https://lyrical-money.glitch.me that takes a locality description and geocodes it using GBIF.

Screenshot 2018 07 05 07 32

You paste in a locality string and it attempt to find that on a map based on data in GBIF. This could be automated, so you could imagine being able to georeference whole collections as part of the process of uploading the data to GBIF. Yes, the devil is in the details, and we'd need ways to flag errors or doubtful records, but the scale of GBIF starts of open up possibilities like this.

So, my question is, "what's next?".

Wednesday, June 13, 2018

Liberating links between datasets using lightweight data publishing: an example using IPNI and the taxonomic literature

Ipni logo I've written a short paper entitled "Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature" (phew) and put a preprint on bioRxiv (https://doi.org/10.1101/343996) while I figure out where to publish it. Here's the abstract:

Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic names, and identifiers for the taxonomic articles that published those names.

In some ways the paper is simply a record of me trying to figure out how to publish a project that I've been working on for several years, namely linking names from BioNames. The preprint discusses various options, before settling on "datasettes", which is a nice method developed by Simon Willison (@simonw) to wrap up simple databases with their own web server and query API and make them accessible on the web. These can run on a local machine, or be packaged up as a Docker container, which is what I've done. You play with the database here: https://ipni.sloppy.zone. If this link is offline, then you can grab the container here https://hub.docker.com/r/rdmpage/ipni/ and run it yourself. If, like me, you're new to Docker, then I recommend grabbing a copy of Kitematic.

The datasette interface is simple but gives you lots of freedom to explore the data.

Fig1

For example, you have ability to query the data using SQL, e.g.:

Fig2

One advantage of this approach is that the data is more accessible. I could just dump the database somewhere but then you'd have to download a large file and figure out how query it. This way, you can play with it straight away. It also means people can make use of it before I make up my mind how best to package it (for example, as part of a larger database of eukaryote names). This is one of the main motivations behind the paper, how to avoid the trap of spending years cleaning and augmenting data and not making it available to others because of the overhead of building a web site around the data. I may look at liberating some other datasets using this approach.

Monday, June 04, 2018

Towards a biodiversity token: Bitcoin, FinTech, and a radical suggestion for the GBIF Challenge

8VlGI2hk 400x400First off, let me say that what follows is a lot of arm waving to try and obscure how little I understand what I'm talking about. I'm going to sketch out what I think is a "radical" idea for a GBIF Challenge entry.

TL;DR GBIF should issue it's own cryptocurrency and use that to fund the development of the GBIF network by charging for downloading cleaned, processed data (original provider data remains free). People can buy subscriptions to get access to data, and/or purchase GBIF currency as a contribution or investment. Proceeds from the purchase of cleaned data are divided between GBIF (to fund the portal), the data providers (to reward them making data available) and the GBIF nodes in countries included in the geographic coverage of the data (to help them build their biodiversity infrastructure). The challenge entry would involve modelling this idea and conducting simulations to test it's efficacy.

The motivation for this idea comes from several sources:

1. GBIF is (under-)funded by direct contributions from governments, hence each year it essentially "begs" for money. Several rich countries (such as the United Kingdom) struggle to pay the fairly paltry sums involved. Part of the problem is that they are taking something of demonstrable value (money) and giving it to an organisation (GBIF) which has no demonstrable financial value. Hence the argument for funding GBIF is basically "it's the right thing to do". This is not really a tenable or sustainable model.

2. Many web sites provide information for "free" in that the visitor doesn't pay any money. Instead the visitor views ads and, whether they are aware if it or not, are handing over large amounts of data about themselves and their behaviour (think the recent scandal involving Facebook).

3. Some people are rebelling against the "free with ads" by seeking other ways to fund the web. For example, the Brave web browser enables you to buy BATS (Basic Attention Tokens, based on Ethereum). You can choose to send BATS to web sites that you visit (and hence find valuable). Those sites don't need to harvest tyour data or bombard you with ads to receive an income.

4. Cryptocurrency is being widely explored as a way to raise funding for new ventures. Many of these are tech-based, but there are some interesting developments in conservation and climate change, such as Veridium which offsets carbon emissions. There are links between efforts like Veridium and carbon offset programmes such as the Rimba Raya Biodiversity Reserve, so you can go from cryptocurrency to trees.

5. The rather ugly, somewhat patronising furore that erupted when Rwanda decided that the best way to increase its foreign currency earnings (as a step towards ultimately freeing itself from dependency on development aid) was to sign a sponsorship deal with Arsenal football club.

Now, imagine a situation where GBIF has a cryptocurrency token (e.g., the "GBIF coin"). Anyone, whether a country, an organisation, or an individual can buy GBIF coins. If you want to download GBIF data, you will need to pay in GBIF coins, either per-download or via a monthly subscription. The proceeds from each download are split in a way that supports the GBIF network as a whole. For example, imagine GBIF itself gets 30% (like Apple's App Store). The remaining 70% gets' split between (a) the data providers and (b) the GBIF nodes in countries included in the data download. For example, almost all the data on a country such as Rwanda does not come from Rwanda itself, but from other countries. You want to reward anyone who makes data available, but you also want to support the development of a biodiversity data infrastructure in Rwanda (or any other country), so part of the proceeds go to the GBIF node in Rwanda.

Now, an immediate issue (apart from the merits or otherwise of blockchains and cryptocurrency) is that I'm advocating charging for access to data, which seems antithetical to open access. To be clear, I think open access is crucial. I'm suggesting that we distinguish between two classes of data. The first is the data as it is provided to GBIF. That is almost always open data under a CC0 license, and that remains free. But if you ant it for free it is served as it is received. In other words, for free access to data GBIF is essentially a dumb repository (like, say, Dryad). The data is there, you can search the metadata for each dataset, so essentially you get something like the current dataset search.

The other thing GBIF does is that it processes the data, cleaning it, reconciling names and locations, and indexing it, so that if you want to search for a given species, GBIF summarises the data across all the datasets and (often) presents you with a better result that if you'd downloaded all the original data and simply merged it together yourself. This is a valuable service, and its one of the reasons why GBIF costs money to run. So imagine that we do something like this:

  1. It is free to browse GBIF as a person and explore the data
  2. It is free to download the raw data provided by any data publisher.
  3. It costs to download cleaned data that corresponds to a specific query, e.g. all records for a particular taxon, geographic area, etc.
  4. Payment for access to cleaned data is via the GBIF coin.
  5. The cost is small, on the scale of buying a music track or subscribing to Spotify.

Now, I don't expect GBIF to embrace this idea anytime soon. By nature it's a conservative, risk-averse organisation. But I think something like this idea deserves serious attention, ideally from people with much better understanding of the issues that my own "I saw this on Twitter therefore it must be cool" level. One way to move forward would be to model how such a system would work, based for example on data on web site visits and data downloads on the current GBIF portal. I suspect models could be built to give some idea of whether such an approach would be financially viable. It occurs to me that something like this would make a great GBIF Challenge entry, particularly as it is gives a license for thinking the unthinkable with no risk to GBIF itself.

Wednesday, May 09, 2018

World Taxonomists and Systematists via ORCID

Taxonomist mapDavid Shorthouse (@dpsspiders) makes some very cool things, and his latest project World Taxonomists & Systematists is a great example of using automation to assemble a list of the world's taxonomists and systematists. The project uses ORCID. As many researchers will know, ORCID's goal is to have every researcher uniquely identified by an ORCID id (mine is https://orcid.org/0000-0002-7101-9767) that is linked to all a researcher's academic output, including papers, datasets, and more. So David has been querying ORCID for keywords such as taxonomist, taxonomy, nomenclature, or systematics to locate taxonomists and add them to his list. For more detail see his post on the ORCID blog.

Using ORCIDs to help taxonomists gain visibility is an idea that's been a round for a little while. I blogged about it in Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact, at which time David was already doing another cool piece of work linking collectors to ORCIDs and their collecting effort, see e.g. data for Terry A. Wheeler.

There are, of course, a bunch of obstacles to this approach. Many taxonomists lack ORCIDs, and I keep coming across "private" ORCIDs where taxonomists have an ORCID id but don't make their profile public, which makes it hard to identify them as taxonomists. Typically I discover these profiles via metadata in CrossRef, which will list the ORCID id for any authors that have them and have made them know to the publisher of their paper.

ORCID ids are only available for people who are alive (or alive recently enough to have registered), so there will be many taxonomists who will never have an ORCID id. In this case, it may be Wikidata to the rescue:

Many taxonomists have Wikidata entries because they are either notable enough to be in Wikipedia, or they have an entry in Wikispecies, and people like Andy Mabbett (@pigsonthewing) have been diligently ensuring these people have Wikidata entries. There's huge scope for making use of these links.

Meanwhile, if you are a taxonomist or a systematist and you don't have an ORCID, get yourself one at ORCID, claim your papers, and you should appear shortly in the World Taxonomists & Systematists list.

2018 GBIF Ebbe Nielsen Challenge now open

Http images ctfassets net uo17ejk9rkwj L6lRFOvdQG4M4yY0k0Cei ad53f85a57368b017fecb8907393d32a ebbe 2018Last year I finished my four-year stint as Chair of the GBIF Science Committee. During that time, partly as a result of my urging, GBIF launched an annual "GBIF Ebbe Nielsen Challenge", and I'm please that this year GBIF is continuing to run the challenge. In 2015 and 2016 the challenge received some great entries.

Last year's challenge (GBIF Challenge 2017: Liberating species records from open data repositories for scientific discovery and reuse didn't attract quite the same degree of attention, and GBIF quietly didn't make an award. I think part of the problem was that there's a fine balance between having a wide open challenge which attracts all sorts of interesting entries, some a little off the wall (my favourite was GBIF data converted to 3D plastic prints for physical data visualisation) versus a specific topic which might yield one or more tools that could, say, be integrated into the GBIF portal. But if you make it too narrow then you run the risk of getting fewer entries, which is what happened in 2017. Ironically, since the 2017 challenge I've come across work that would have made a great entry, such as a thesis by Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via. Purpose Recognition, see also Bernardo, I. R., Borges, M., Baranauskas, M. C. C., & Santanchè, A. (2015). Interpretation of Construction Patterns for Biodiversity Spreadsheets. Lecture Notes in Business Information Processing, 397–414. doi:10.1007/978-3-319-22348-3_22.

This year the topic is pretty open:

The 2018 Challenge will award €34,000 for advancements in open science that feature tools and techniques that improve the access, utility or quality of GBIF-mediated data. Under this open-ended call, challenge submissions may build on existing tools and features, such as the GBIF API, Integrated Publishing Toolkit, data validator, relative species occurrence tool, among others—or develop new applications, methods, workflows or analyses.

Lots of scope, and since I'm not longer part of the GBIF Science Committee it's tempting to think about taking part. The judging criteria are pretty tough and results-oriented:

Winning entries will demonstrably extend and increase the usefulness, openness and visibility of GBIF-mediated data for identified stakeholder groups. Each submission is expected to demonstrate advantages for at least three of the following groups: researchers, policymakers, educators, students and citizen scientists.

So, maybe less scope for off-the-wall stuff, but an incentive to clearly articulate why a submission matters.

The actual submission process is, sadly, rather more opaque than in previous years where it was run in the open on Devpost where you can still see previous submissions (e.g., those for 2015). Devpost has lots of great features but isn't cheap, so the decision is understandable. Maybe some participants will keep the rest of the community informed via, say, Twitter, or perhaps people will keep things close to their chest. In any event, I hope the 2018 challenge inspires people to think about doing something both cool and useful with biodiversity data. Oh, and did I mention that a total of €34,000 in prizes is up for grabs? Deadline for submission is 5 September 2018.

iSpecies meets Lifemap

It's been a little quiet on this blog as I've been teaching, and spending a lot of time data wrangling and trying to get my head around "data lakes" and "triple stores". So there are a few things to catch up on, and a few side projects to report on.

I continue to play with iSpecies, which is a simple mashup off biodiversity data sources. When I last blogged about iSpecies I'd added TreeBASE as a source (iSpecies meets TreeBASE). iSpecies also queries Open Tree of Life, and I've always wanted a better way of displaying the phylogenetic context of a species or genus. TreeBASE is great for a detailed, data-driven view, but doesn't put the taxon in a larger context, nor does the simple visualisation I developed for Open Tree of Life.

A nice large-scale tree visualisation is Lifemap (see De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. doi:10.1371/journal.pbio.2001624), and it dawned on me that since Lifemap uses the same toolkit (leaflet.js) that I use to display a map of GBIF records, I could easily add it to iSpecies. After looking at the Lifemap HTML I figured out the API call I need to pan the map to given taxon using Open Tree of Life taxon identifiers, and violà, I now have a global tree of life that shows where the query taxon fits in that tree.

Here's a screenshot of a search for Podocarpus showing the first 300 records from GBIF, and the position of Podocarpus in the tree of life. The tree is interactive so you can zoom and pan just like the GBIF map.

Screenshot 2018 05 09 16 58 00

Here's another one for the genus Timonius:

Screenshot 2018 05 09 17 58 32

Very much still at the "quick and dirty" stage, but I continue to marvel at how much information can be assembled "on the fly" from a few sources, and how much richer this seems than what biodiversity informatics projects offer. There's a huge amount of information that is simpy being missed or under-utilised in this area.

Wednesday, January 24, 2018

Guest post: The Not problem

Bob mesibovThe following is a guest post by Bob Mesibov.

Nico Franz and Beckett Sterner created a stir last year with a preprint in bioRxiv about expert validation (or the lack of it) in the "backbone" classifications used by aggregators. The final version of the paper was published this month in the OUP journal Database (doi:10.1093/database/bax100).

To see what effect "backbone" taxonomies are having on aggregated occurrence records, I've recently been auditing datasets from GBIF and the Atlas of Living Australia. The results are remarkable, and I'll be submitting a write-up of the audits for formal publication shortly. Here I'd like to share the fascinating case of the genus Not Chan, 2016.

I found this genus in GBIF. A Darwin Core record uploaded by the New Zealand Arthropod Collection (NZAC02015964) had the string "not identified on slide" in the scientificName field, and no other taxonomic information.

GBIF processed this record and matched it to the genus Not Chan, 2016, which is noted as "doubtful" and "incertae sedis".

There are 949 other records of this genus around the world, carefully mapped by GBIF. The occurrences come from NZAC and nine other datasets. The full scientific names and their numbers of GBIF records are:

NumberName
2Not argostemma
14not Buellia
1not found, check spelling
1Not given (see specimen note) bucculenta
1Not given (see specimen note) ortoni
1Not given (see specimen note) ptychophora
1Not given (see specimen note) subpalliata
1not identified on slide
1not indentified
1Not known not known
1Not known sp.
1not Lecania
4Not listed
873Not naturalised in SA sp.
18Not payena
5not Punctelia
18not used
6Not used capricornia Pleijel & Rouse, 2000

GBIF cites this article on barnacles as the source of the genus, although the name should really be Not Chan et al., 2016. A careful reading of this article left me baffled, since the authors nowhere use "not" as a scientific name.

Next I checked the Catalogue of Life. Did CoL list this genus, and did CoL attribute it to Chan? No, but "Not assigned" appears 479 times among the names of suprageneric taxa, and the December 2018 CoL checklist includes the infraspecies "Not diogenes rectmanus Lanchester,1902" as a synonym.

The Encyclopedia of Life also has "Not" pages, but these have in turn been aggregated on the "EOL pages that don't represent real taxa" page, and under the listing for the "Not assigned36" page someone has written:

This page contains a bunch of nodes from the EOL staff Scratchpad. NB someone should go in and clean up that classification.

"Someone should go in and clean up that classification" is also the GBIF approach to its "backbone" taxonomy, although they think of that as "we would like the biodiversity informatics community and expert taxonomists to point out where we've messed up". Franz and Sterner (2018) have also called for collaboration, but in the direction of allowing for multiple taxonomic schemes and differing identications in aggregated biodiversity data. Technically, that would be tricky. Maybe the challenge of setting up taxonomic concept graphs will attract brilliant developers to GBIF and other aggregators.

Meanwhile, Not Chan, 2016 will endure and aggregated biodiversity records will retain their vast assortment of invalid data items, character encoding failures, incorrect formatting, duplications and truncated data items. In a post last November on the GitHub CoL+ pages I wrote:

Being old and cynical, I can speculate that in the time spent arguing the "politics" of aggregation in recent years, a competent digital librarian or data scientist would have fixed all the CoL issues and would be halfway through GBIF's. But neither of those aggregators employ digital librarians or data scientists, and I'm guessing that CoL+ won't employ one, either.

Monday, December 11, 2017

Towards a digital natural history museum

Untitled

These notes are the result of a few events I've been involved in the last couple of months, including TDWG 2017 in Ottawa, a thesis defence in Paris, and a meeting of the Science Advisory Board of the Natural History Museum in London. For my own benefit if no one else's, I want to sketch out some (less than coherent) ideas for how a natural history museum becomes truly digital.

Background

The digital world poses several challenges for a museum. In terms of volume of biodiversity data, museums are already well behind two major trends, observations from citizen science and genomics. The majority of records in GBIF are observations, and genomics databases are growing exponentially, through older initiatives such as barcoding, and newer methods such as environmental genomics. While natural history collections contain an estimated 109 specimens or "lots" [1], less than a few percent of that has been digitised, and it is not obvious that massive progress in increasing this percentage will be made any time soon.

Furthermore, for citizen science and genomics it is not only the amount of data but the network effects that are possible with that data that make it so powerful. Network effects arise when the value of something increases as more people use it (the classic example is the telephone network). In the case of citizen science, apart from the obvious social network that can form around a particular taxon (e.g., birds), there are network effects from having a large number of identified observations. iNaturalist is using machine learning to suggest identifications of photos taken by members. The more members join and add photos and identifications, the more reliable the machine identifications become, which in turn makes it more desirable to join the network. Genomics data also shows network effects. In effect, a DNA sequence is useless without other sequences to compare it with (it is no accident that the paper describing BLAST is one of the most highly cited in biology). The more sequences a genomics database has the more useful it is.

For museums the explosion of citizen science and genomics begs the question "is there any museum data that can show similar network effects"? We should also ask whether there will be an order of magnitude increase in digitisation of specimens in the near future. If not, then one could argue that museums are going to struggle to remain digitally relevant if they remain minority biodiversity data providers. Being part of organisations such as GBIF certainly helps, but GBIF doesn't (yet) offer much in the way of network effects.

Users

We could divide the users of museums into three distinct (but overlapping) communities. These are:

  1. Scientists
  2. Visitors
  3. Staff

Scientists make use of research and data generated by the museum. If the museum doesn't support science (both inside and outside the museum) then the rationale for the collections (and associated staff) evaporates. Hence, digitisation must support scientific research.

Visitors in this sense means both physical and online visitors. Online visitors will have a purely digital experience, but in person visitors can have both physical and digital experiences.

In many ways the most neglected category is the museum staff. Perhaps best way to make progress towards a digital museum is having the staff committed to that vision, and this means digitisation should wherever possible make their work easier. In many organisations going digital means a difficult transition period of digitising material, dealing with crappy software that makes their lives worse, and a lack of obvious tangible benefits (digitisation for digitisation's sake). Hence outcomes that deliver benefits to people doing actual work should be prioritised. This is another way of saying that museums need to operate as "platforms", the best way to ensure that external scientists will use the museums digital services is if the research of the museum's own staff depends on those services.

Some things to do

For each idea I sketch a "vision", some ways to get there, what I think the current reality is (and, let's be honest, what I expect it to still be like in 10 years time).


Vision: Anyone with an image of an organism can get a answer to the question "what is this?"

Task: Image the collection in 2D and 3D. Computers can now "see", and can accomplish tasks such as identify species and traits (such as the presence of disease [2]) from images. This ability is based on machine learning from large numbers of images. The museum could contribute to this by imaging as many specimens as possible. For example, a library of butterfly photos could significantly increase the accuracy of identifications by tools such as iNaturalist. Creating 3D models of specimens could generate vast numbers of training images [3] to further improve the accuracy of identifications. The museum could aim to provide identifications for the majority of species likely to be encountered/photographed by its users and other citizen scientists.

Reality: Imaging is unlikely to be driven by identification and machine learning, beiggest use is to provide eye-catching images for museum publicity.

Who can help: iNaturalist has experience with machine learning. More and more of research is appearing on image recognition, deep learning, and species identification.


Vision: Anyone with a DNA sequence can get a answer to the question "what is this?"

Task: DNA sequence the collection, focussing first on specimens that (a) have been identified and (b) represent taxonomic groups that are dominated by "dark taxa" in GenBank. Many sequences being added to GenBank are unidentified and hence unnamed. These will only become named (and hence potentially connected to more information) if we have sequences from identified material of those species (or close relatives). Often discussions of sequences focus on doing the type specimens. While this satisfies the desire to pin a name to a sequence in the most rigorous way, it doesn't focus on what users need - an answer to "what is this?" The number of identified specimens will far exceed the number of type specimens, and many types will not be easily sequenced. Sequencing identified specimens puts the greatest amount of museum-based information into sequence space. This will become even more relevant as citizen science starts to expand to include DNA sequences (e.g., using tools like MinION).

Reality: Lack of clarity over what taxa to prioritise, emphasis on type specimens, concerns over whether DNA barcoding is out of date compared to other techniques (ignoring importance of global standardisation as a way to make data maximally useful) will all contribute to a piecemeal approach.

Who can help: Explore initiatives such as the Planetary Biodiversity Mission.


Vision: A physical visitor to the museum has a digital experience deeply informed by the museum's knowledge

Task: The physical walls of the museum are not barriers separating displays from science but rather interfaces to that knowledge. Any specimen on display is linked to what we know about it. If there is a fossil on a wall, we can instantly see the drawings made of that specimen in various publications, 3D scans to interact with, information about the species, the people who did the work (whether historical figures or current staff), and external media (e.g., BBC programs).

Reality: Piecemeal, short-lived gimmicky experiments (such as virtual reality), no clear attempt to link to knowledge that visitors can learn from or create themselves. Augmented reality is arguably more interesting, but without connections to knowledge it is a gimmick.

Who could help: Many of the links between specimens, species, and people full into the domain of Wikipedia and Wikidata, hence lots of opportunities for working with GLAM Wiki community.


Vision: A museum researcher can access all published information about a species, specimen, or locality via a single web site.

Task: All books and journals in the museum library that are not available online should be digitised. This should focus on materials post 1923 as pre-1923 is being done by BHL. The initial goal is to provide its researchers with the best possible access to knowledge, the secondary goal is to open that up to the rest of the world. All digitised content should be available to researchers within the museum using a model similar to the Haithi Trust which manages content scanned by Google Books. The museum aggressively pursues permission to open as much of the digitised content up as it can, starting with its own books and journals. But it scans first, sorts out permissions later. For many uses, full access isn't necessarily needed, at least for discovery. For example, by indexing text for scientific names, specimen codes, and localities, researchers could quickly discover if a text is relevant, even if ultimately direct physically access is the only possibility for reading it.

Reality: Piecemeal digitisation hampered by the chilling effects of copyright, combined with limited resources means the bulk of our scientific knowledge is hard to access. A lack of ambition means incremental digitisation, with most taxonomic research remaining inaccessible, and new research constrained by needing access to legacy works in physical form.

Who could help: Consider models such as Hathi, work with BHL and publishers to open up more content, and text mining researchers to help maximise use even for content that can't be opened up straight away.


Vision: The museum as a "connection machine" to augment knowledge

Task: While a museum can't compete in terms of digital volume, it can compete for richness and depth of linking. Given a user with a specimen, an image, a name, a place, how can the museum use its extensive knowledge base to augment that user's experience? By placing the thing in a broader context (based on links derived from image -> identity tools, sequence -> identity tools, names to entities e.g., species, people and places, and links between those entites) the museum can enhance our experience of that thing.

Reality: The goal of having everything linked together into a knowledge graph is often talked about, but generally fails to happen, partly because things rapidly descend into discussions about technology (most of which sucks), and squabbling over identifiers and vocabularies. There is also a lack of clear drivers, other than "wouldn't it be cool?". Hence expect regular calls to link things together (e.g., Let’s rise up to unite taxonomy and technology), demos and proof of concept tools, but little concrete progress.

Who can help: The Wikidata community, initiatives such as (some of these are no longer alive but useful to investigate) Big Data Europe, BBC Things. The BBC's defunct Wildlife Finder is an example of what can be achieved with fairly simple technology.

Summary

The fundamental challenge the museum faces is that it is analogue in an increasingly digital world. It cannot be, nor should it be, completely digital. For one thing it can't compete, for another its physical collection, physical space, and human expertise are all aspects that make a museum unique. But it needs to engage with visitors that are digitally literate, it needs to integrate with the burgeoning digital knowledge being generated by both citizens and scientists, and it needs to provide its own researchers with the best possible access to the museum's knowledge. Above all, it needs to have a clear vision of what "being digital means".

References

1. Ariño, A. H. (2010). Approaches to estimating the universe of natural history collections data. Biodiversity Informatics, 7(2). https://doi.org/10.17161/bi.v7i2.3991

2. Ramcharan, A., Baranowski, K., McCloskey, P., Ahmed, B., Legg, J., & Hughes, D. P. (2017). Deep Learning for Image-Based Cassava Disease Detection. Frontiers in Plant Science, 8. https://doi.org/10.3389/fpls.2017.01852

3. Xingchao Peng, Baochen Sun, Karim Ali, Kate Saenko (2014) Learning Deep Object Detectors from 3D Models. https://arxiv.org/abs/1412.7122

Tuesday, December 05, 2017

Blue Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and opportunities lost

David Attenborough’s latest homage to biodiversity, Blue Planet II is, as always, visually magnificent. Much of its impact derives from the new views of life afforded by technological advances in cameras, drones, diving gear, and submersibles. One might hope that the supporting information online reflected the equivalent technological advances made in describing and sharing information. Sadly, this is not the case. Instead the BBC offers a web site with a video clips and a poster... a $%@£ poster.

Oceans poster feat

This is a huge missed opportunity. Where do people go to learn more about the organisms featured in an episode? How do we discover related content on the BBC and elsewhere? How do we discover the science underpinning each episode that has been so exquisitely filmed and edited?

Perhaps the lack of an online resource reflects a lack of resources, or expertise? Yet one look at the series (and the "Into the blue" epilogues) tells us that resources are hardly limiting. Furthermore, the BBC has previously constructed rich, informative web sites to support natural history programming. The now deprecated BBC Nature Wildlife site had an extensive series of web pages for the organisms featured in BBC programmes, with links to individual clips. For each organism the corresponding web page listed key traits such as behaviours, habitats, and geographic distribution, and each of these traits had its own web page list all organisms with those traits (see, for example the page for Steller's Sea Eagle).

Screenshot 2017 12 05 13 12 02

Underlying all this information was a simple vocabulary (the Wildlife Ontology), and the entire corpus is also available in RDF: in other words, the BBC used Semantic Web technologies to structure this information. To get this data you simply append ".rdf" to the URL for a web page. For example, below is the RDF for Steller's Sea Eagle. It is not pretty, but it is a great example of machine-readable data which enables all sorts of interesting things to be built.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:dctypes="http://purl.org/dc/dcmitype/"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:po="http://purl.org/ontology/po/"
xmlns:wo="http://purl.org/ontology/wo/">
<rdf:Description rdf:about="/nature/species/Steller's_Sea_Eagle">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
<rdfs:seeAlso rdf:resource="/nature/species"/>
</rdf:Description>
<wo:Species rdf:about="/nature/life/Steller's_Sea_Eagle#species">
<rdfs:label>Steller's sea eagle</rdfs:label>
<wo:name rdf:resource="http://www.bbc.co.uk/nature/species/Steller's_Sea_Eagle#name"/>
<foaf:depiction rdf:resource="http://ichef.bbci.co.uk/naturelibrary/images/ic/640x360/s/st/stellers_sea_eagle/stellers_sea_eagle_1.jpg"/>
<dc:description>Steller’s sea eagles are native to eastern Russia, inhabiting coastal cliffs and estuaries where they can easily access good fishing territories. They feed primarily on salmon, which they catch by swooping from perches located by the water's edge. Pairs are monogamous and hatch an average of two chicks each season, although crows and martens commonly take both eggs and young birds from the nest. During winter a small number of birds remain in Russia to tough it out, but the majority fly south to Japan.</dc:description>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/Steller's_Sea_Eagle"/>
<wo:adaptation rdf:resource="/nature/adaptations/Altricial#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Animal_migration#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Carnivore#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Flight#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Hearing_(sense)#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Monogamous_pairing_in_animals#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Oviparity#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Parental_investment#adaptation"/>
<wo:livesIn rdf:resource="/nature/habitats/Coast#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/Estuary#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/Marsh#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/River#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/Swamp#habitat"/>
<wo:genus rdf:resource="/nature/life/Sea_eagle#genus"/>
<wo:family rdf:resource="/nature/life/Accipitridae#family"/>
<wo:order rdf:resource="/nature/life/Falconiformes#order"/>
<wo:class rdf:resource="/nature/life/Bird#class"/>
<wo:phylum rdf:resource="/nature/life/Chordate#phylum"/>
<wo:kingdom rdf:resource="/nature/life/Animal#kingdom"/>
</wo:Species>
<wo:TaxonName rdf:about="/nature/species/Steller's_Sea_Eagle#name">
<rdfs:label>Haliaeetus pelagicus</rdfs:label>
<wo:commonName>Steller's sea eagle</wo:commonName>
<wo:scientificName>pelagicuspelagicus</wo:scientificName>
<wo:kingdomName>animalia</wo:kingdomName>
<wo:phylumName>Chordata</wo:phylumName>
<wo:className>Aves</wo:className>
<wo:orderName>Falconiformes</wo:orderName>
<wo:familyName>Accipitridae</wo:familyName>
<wo:genusName>Haliaeetus</wo:genusName>
<wo:speciesName>pelagicus</wo:speciesName>
</wo:TaxonName>
<foaf:Image rdf:about="http://ichef.bbci.co.uk/naturelibrary/images/ic/640x360/s/st/stellers_sea_eagle/stellers_sea_eagle_1.jpg">
<foaf:depicts rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
<foaf:thumbnail rdf:resource="http://ichef.bbci.co.uk/naturelibrary/images/ic/83x104/s/st/stellers_sea_eagle/stellers_sea_eagle_1.jpg"/>
</foaf:Image>
<po:Clip rdf:about="http://www.bbc.co.uk/programmes/p00dhn1t#programme">
<dc:title>Lunch on the wing</dc:title>
<po:subject rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</po:Clip>
<po:Clip rdf:about="http://www.bbc.co.uk/programmes/p00382f5#programme">
<dc:title>Steller's sea eagle</dc:title>
<po:subject rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</po:Clip>
<dctypes:Sound rdf:about="http://downloads.bbc.co.uk/earth/naturelibrary/assets/s/st/stellers_sea_eagle/5015017.mp3">
<dc:title>Calls from Steller's and white-tailed sea eagles</dc:title>
<dc:subject rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</dctypes:Sound>
<foaf:Document rdf:about="http://en.wikipedia.org/wiki/Steller's_Sea_Eagle">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://animaldiversity.ummz.umich.edu/site/accounts/information/Haliaeetus_pelagicus.html">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.arkive.org/stellers-sea-eagle/haliaeetus-pelagicus/">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.birdlife.org/datazone/species/index.html?action=SpcHTMDetails.asp&sid=3366&m=0">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.flickr.com/search/show/?q=steller+sea+eagle&s=int">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.iucnredlist.org/details/144342/0">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.natural-research.org/index.php?cID=169">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<wo:ReproductionStrategy rdf:about="/nature/adaptations/Altricial#adaptation">
<rdfs:label>Helpless young</rdfs:label>
</wo:ReproductionStrategy>
<wo:SurvivalStrategy rdf:about="/nature/adaptations/Animal_migration#adaptation">
<rdfs:label>Migration</rdfs:label>
</wo:SurvivalStrategy>
<wo:FeedingHabit rdf:about="/nature/adaptations/Carnivore#adaptation">
<rdfs:label>Carnivorous</rdfs:label>
</wo:FeedingHabit>
<wo:LocomotionAdaptation rdf:about="/nature/adaptations/Flight#adaptation">
<rdfs:label>Adapted to flying</rdfs:label>
</wo:LocomotionAdaptation>
<wo:CommunicationAdaptation rdf:about="/nature/adaptations/Hearing_(sense)#adaptation">
<rdfs:label>Acoustic communication</rdfs:label>
</wo:CommunicationAdaptation>
<wo:ReproductionStrategy rdf:about="/nature/adaptations/Monogamous_pairing_in_animals#adaptation">
<rdfs:label>Monogamous</rdfs:label>
</wo:ReproductionStrategy>
<wo:ReproductionStrategy rdf:about="/nature/adaptations/Oviparity#adaptation">
<rdfs:label>Egg layer</rdfs:label>
</wo:ReproductionStrategy>
<wo:LifeCycle rdf:about="/nature/adaptations/Parental_investment#adaptation">
<rdfs:label>Parental investment</rdfs:label>
</wo:LifeCycle>
<wo:TerrestrialHabitat rdf:about="/nature/habitats/Coast#habitat">
<rdfs:label>Coastal</rdfs:label>
</wo:TerrestrialHabitat>
<wo:MarineHabitat rdf:about="/nature/habitats/Estuary#habitat">
<rdfs:label>Estuaries</rdfs:label>
</wo:MarineHabitat>
<wo:FreshwaterHabitat rdf:about="/nature/habitats/Marsh#habitat">
<rdfs:label>Marsh</rdfs:label>
</wo:FreshwaterHabitat>
<wo:FreshwaterHabitat rdf:about="/nature/habitats/River#habitat">
<rdfs:label>Rivers and streams</rdfs:label>
</wo:FreshwaterHabitat>
<wo:FreshwaterHabitat rdf:about="/nature/habitats/Swamp#habitat">
<rdfs:label>Swamp</rdfs:label>
</wo:FreshwaterHabitat>
<wo:Genus rdf:about="/nature/genus/Sea_eagle#genus">
<rdfs:label>Haliaeetus</rdfs:label>
<wo:species rdf:resource="/nature/life/Steller's_Sea_Eagle#species"/>
<wo:species rdf:resource="/nature/life/African_Fish_Eagle#species"/>
<wo:species rdf:resource="/nature/life/White-tailed_Eagle#species"/>
</wo:Genus>
<wo:Family rdf:about="/nature/family/Accipitridae#family">
<rdfs:label>Accipitridae</rdfs:label>
</wo:Family>
<wo:Order rdf:about="/nature/order/Falconiformes#order">
<rdfs:label>Falconiformes</rdfs:label>
</wo:Order>
<wo:Class rdf:about="/nature/class/Bird#class">
<rdfs:label>Aves</rdfs:label>
</wo:Class>
<wo:Phylum rdf:about="/nature/phylum/Chordate#phylum">
<rdfs:label>Chordata</rdfs:label>
</wo:Phylum>
<wo:Kingdom rdf:about="/nature/kingdom/Animal#kingdom">
<rdfs:label>animalia</rdfs:label>
</wo:Kingdom>
</rdf:RDF>

For some reason, this web site is now deprecated. As an exercise I grabbed the RDF from the web site, did a little cleaning, and merged it together resulting in a set of around 94,500 triples (statements of the form “subject”, “predicate”, “object”). For example, this triple says that Steller's Sea Eagle is monogamous.

[/nature/life/Steller's_Sea_Eagle#species,
wo:adaptation,
/nature/adaptations/Monogamous_pairing_in_animals#adaptation]

One reason the Semantic Web has struggled to gain widespread adoption is the long list of things you need to get to the point where it is usable. You need data consistently structured using the same vocabulary. You need identifiers that everyone agrees on (or at least can map their own identifiers too). And you need a triple store, which is essentially a graph database, a technology that is still unfamiliar to many. But in this case the BBC has done a lot of the hard work by cleverly minting identifiers based on Wikipedia URLs (”slugs”), and developing a vocabulary to express relationships between organisms, traits, and habitats. All that’s needed is a way to query this data. Rather than use a triple store (most of which are not much fun to install or maintain) I’ve used the delightfully simple approach of employing a Hexastore. Hexastores provide fast querying of graphs by indexing all six permutations of the subject, predicates, object triple (hence “hexa”). The approach is sufficiently simple that for moderately sized databases we can implement it in Javascript and run it in a web browser.

As a demonstration, I created a very crude hexastore-based version of the BBC pages (https://rdmpage.github.io/bbc-wildlife/www/.

Screenshot 2017 12 05 13 13 51

Once you load the page there are no further server requests, other than fetching images. Every query is “live” but takes place in the browser. You can click on the image for a species and get some textural information, as well as images representing traits of that organism. Click on a trait and you discover what organisms share those traits. This example is trivial, but surprisingly rich. I’ve found it fascinating to simply bounce around the images discovering unexpected facts about different species. There’s lots of potential for serendipitous discovery, as well as an enhanced appreciation for just how rich the BBC’s content is. If the Encyclopedia of Life were this engaging I’d be it’s biggest fan.

The question then, is why a similar approach was not taken for Blue Planet II? It can’t be a lack of resources, this series has amazing production values. And yet a wonderful opportunity has been missed. Why not build on the existing work and create an interactive resource that encourages people to explore more deeply and learn more? Much of the existing data could be used, as well as adding all the new species and behaviours we see on our TV screens. Blue Planet also highlights the impacts humans are having on the marine environment, these could be added as categories as well to show wat organisms are susceptible to different impacted (e.g., plastics).

That the BBC thinks a poster is an adequate for of engagement in the digital age speaks of a corporation that, in spite of many triumphs in the digital sphere (e.g., iPlayer) has not fully grasped the role the web can play in making its content more widely useful and relevant, beyond enthralling viewers on a Sunday evening. It also seems oblivious to the fact that it already knows how to deliver rich, informative online content (as evidenced by the now deprecated Wildlife application). So please, BBC, can we have a resource that enables us to learn more about the organisms and habitats that are the subjects of the grandeur and beauty we see on our TV screens?

Follow up

Below is some of the discussion this post generated on Twitter.