Wednesday, April 15, 2015

GBIF Ebbe Nielsen Challenge finalists announced

The six finalists for the GBIF Ebbe Nielsen Challenge have been announced by GBIF:

“The creativity and ambition displayed by the finalists is inspiring’, said Roderic Page, chair of the Challenge jury and the GBIF Science Committee, who introduced the Challenge at GBIF’s 2014 Science Symposium in October.

“My biggest hope for the Challenge was that the biodiversity community would respond with innovative—even unexpected—entries,” Page said. “My expectations have been exceeded, and the Jury is eager to see what the finalists can achieve between now and the final round of judging.”

The finalists all receive a €1,000 prize, and now have the possibility to refine their work and compete for the grand prize of €20,000 (€5000 for second place). F1F2F3 As the rather cheesy quote above suggests, I think the challenge has been a success in terms of the interest generated, and the quality of the entrants. While the finalists bask in glory, it's worth thinking about the future of the challenge. If it is regarded as a success, should it be run in the same way next year? The first challenge was very open in terms of scope (pretty much anything that used GBIF data), would it be better to target the challenge on a more focussed area? If so, which area needs the nost attention. Food for thought.

Linking specimen codes to GBIF

I've put together a working demo of some code I've been working on to discover GBIF records that correspond to museum specimen codes. The live demo is at and code is on GitHub.

To use the demo, simply paste in a specimen code (e.g., "MCZ 24351") and click Find and it will do it's best to parse the code, then go off to GBIF and see what it can find. Some examples that are fun include MCZ 24351, KU:IT:00312, MNHN 2003-1054, and AMS I33708-051


It's proof of concept at this stage, and the search is "live", I'm not (yet) storing any results. For now I simply want to explore how well if can find matches in GBIF.

By itself this isn't terribly exciting, but it's a key step towards some of the things I want to do. For example, the NCBI is interested in flagging sequences from type specimens (see ), so we could imagine taking lists of type specimens from GBIF and trying to match those to voucher codes in GenBank. I've played a little with this, unfortunately there seem to be lots of cases where GBIF doesn’t know that a specimen is, in fact, a type.

Another thing I’m interested in is cases where GBIF has a georeferenced specimen but GenBank doesn’t (or visa versa), as a stepping stone towards creating geophylogenies. For example, in order to create a geophylogeny for Agnotecous crickets in New Caledonia (see GeoJSON and geophylogenies ) I needed to combine sequence data from NCBI with locality data from GBIF.

It’s becoming increasingly clear to me that the data supplied to GBIF is often horribly out of date compared to what is in the literature. Often all GBIF gets is what has been scribbled in a collection catalogue. By linking GBIF records to specimen codes cited that are cited in the literature we could imagine giving GBIF users enhanced information on a given occurrence (and at the same time get citation counts for specimens The impact of museum collections: one collection ≈ one Nobel Prize).

Lastly, if we can link specimens to sequences and the literature, then we can populate more of the biodiversity knowledge graph

Tuesday, March 10, 2015

GBIF Ebbe Nielsen Challenge submissions: judging begins

The GBIF Ebbe Nielsen Challenge has closed and we have 23 submissions for the jury to evaluate. There's quite a range of project types (and media, including sound and physical objects), and it's going to be fascinating to evaluate all the entries (some of which are shown below). This is the first time GBIF has run this challenge, so it's gratifying to see so much creativity in response to the challenge. Submissions While judging itself is limited to the jury (of which I'm a member), I'd encourage anyone interested in biodiversity informatics to browse the submissions. Although you can't leave comments directly on the submissions within the GBIF Challenge pages, each submission also appears on the portfolio page of the person/organisation that created the entry, so you can leave comments there (follow the link at the bottom of the page for each submission to see it on the portfolio page).

Friday, February 20, 2015

More examples of data duplication and loss in GBIF: Australian bats in bits

Quick notes on another example of data duplication in GBIF. I'm in the process of building a tool to map specimen codes to GBIF records, and came across the following example. Consider the specimen code "AM M.22320", which is the voucher for the sequence KJ532444 (GenBank gives the voucher as M22320, but the associated paper doi:10.1016/j.ympev.2014.03.009 clarifies that this specimen comes from the Australian Museum). Locating this specimen in GBIF I found a series of records that were identical except for the catalogNumbers, which looked like this: M.22320.001, M.22320.002, etc. What gives?

Initially I thought this may be a simple case of data duplication (maybe the suffixes represent different versions of the same record?). Then I managed to locate the records on the Australian Museum web site:.

  • M.22320.009 - Wet Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.008 - Skull Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.001 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.005 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.006 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.007 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.003 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.004 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.002 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.010 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990

Turns out we have 10 records for "M.22320", which include various preparations and tissue samples. The holotype specimen for Pteralopex taki (originally described in doi:10.1071/AM01145, see BioNames) has generated 10 different records., all of which have ended up in GBIF. Anyone using GBIF occurrence data and interpreting the number of occurrence records as a measure of how abundant an organism is at a given locality is clearly going to be misled by data like this.

One way to tackle this problem would be if GBIF (or the data provider) could cluster the records that represent the "same" specimen, so GBIF doesn't end up duplicating the same information (in this case, 10-fold). The Australian Museum records don't seem to specify a direct link between the 10 records. I then located the same records in OZCAM, the data provider that feeds GBIF. Here is the OZCAM record for "M.22320.001": OZCAM doesn't have the information on whether the record is a skull, a wet preparation, or a tissue sample, that information has been lost, and hence doesn't make it as far as GBIF.

Note that OZCAM has resolvable identifiers for each specimen in the form of UUIDs that are appended to "". The corresponding UUIDs are included in the Darwin Core dump that OZCAM makes available to GBIF. Here they are for the parts of M.22320:


But when GBIF parses the dump it ignores these UUIDs, which means the GBIF user can't easily go to the OZCAM site (which has a bunch of other useful information, compare with ). It also means that GBIF has stripped out an identifier that we might make use of to unambiguously refer to each record (and, presumably, this UUID doesn't change between harvests of OZCAM data).

In summary, this is a bit of a mess: we have multiple records that are really just bits of the same specimen but which are not linked together by any data provider, and as the data is transmitted up the chain to GBIF clues as to what is going on are stripped out. For a user like me who is trying to link the GenBank sequence to its voucher this is frustrating, and ultimately all rather avoidable if we took just a little more care in how we represent data about specimens, and how we treat that data as it gets transmitted between data bases.

Thursday, February 19, 2015

Post publication review on PubPeer

036e889646a0ff112e5f150a625f5268PubPeer is a web site where people can discuss published articles, anonymously if they prefer. I finally got a chance to play with it a few days, it it was a fascinating experience. You simply type in the DOI or PMID for an article and see if anyone has said anything about that article. It also automatically pulls comments from PubMed Commons, for example the article Putting GenBank data on the map has a comment that was originally published as a guest post on this blog. PubPeer knows about this blog post via Altmetric, which is another nice feature. PubPeer also has browser extensions which, if you install one, automatically flags DOIs on web pages that have comments on PubPeer. Also nice.

So, I took PubPeer for a spin. While browsing GenBank and GBIF, as you do, I came across the following paper: "Conservation genetics of Australasian sailfin lizards: Flagship species threatened by coastal development and insufficient protected area coverage" doi:10.1016/j.biocon.2013.10.014. Some of the sequences from this paper, such as KF874877 are flagged as "UNVERIFIED". Puzzled by this, I raised the issue on PubPeer (see ). A little further digging led to the suggestion that they were numts. After raising the issue on Twitter, one of the authors (Cameron Siler) got in touch and reported that there had been an accidental deletion of a single nucleotide in an alignment. Cameron is updating the Dryad data ( ) and GenBank sequences.

I like the idea that there is a place we can go to discuss the contents of a paper. It's not controlled by the journal, and you can either identify yourself or remain anonymous if you prefer. Not everyone is a fan of this mode of commentary, especially it is possible for people to make all sorts of accusations while remaining anonymous. But it's a fascinating project, and well worth spending some time browsing around (what IS it with physicists?). For anyone interested in annotating data, it's also a nice example of one possible approach.

Wednesday, January 28, 2015

Annotating GBIF, from datasets to nanopublications

Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

  1. A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
  2. A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
  3. A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.


As an example, consider the following two occurrences for Psilogramma menephron:

occurrencetaxonlongitudelatitudecatalogue numbersequence
887386322Psilogramma menephron Cramer, 1780145.86301-17.44BC ZSM Lep 01337
1009633027Psilogramma menephron Cramer, 1780145.86-17.44KJ168695KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIs

In terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.


If we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are:
A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.
A nanopublication comprises three elements:
  1. The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
  2. The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
  3. The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112
Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?

Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.


My goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.

Thursday, January 22, 2015

GeoJSON and geophylogenies

For the last few weeks I've been working on a little project to display phylogenies on web-based maps such as OpenStreetMap and Google Maps. Below I'll sketch out the rationale, but if you're in a hurry you can see a live demo here:, and some examples below.

The first is the well-known example of Banza katydids from doi:10.1016/j.ympev.2006.04.006, which I used in 2007 when playing with Google Earth.


The second example shows DNA barcodes similar to ABFG379-10 for Proechimys guyannensis and its relatives.



People have been putting phylogenies on computer-based maps for a while, but in most cases these have required stand-alone software, such as Google Earth, or GeoJSON for encoding geographic information. Despite the obvious appeal of placing trees in maps, and calls for large-scale geophylogeny databases (e.g., do:10.1093/sysbio/syq043), computerised drawing trees on maps has remained a bit of a niche activity. I think there are several reasons for this:

  1. Drawing trees on maps needs both a tree and geographic localities for the nodes in the tree. The later are not always readily available, or may be in different databases to the source of phylogenetic data.
  2. There's no accepted standard for encoding geographic information associated with the leaves in a tree, so everyone pretty much invents their own format.
  3. To draw the tree we typically need standalone software. This means users have to download software, instead of work on the web (which is where all the data is).
  4. Geographic formats such as KML (used by Google Earth) are not particularly easy to store and index in databases.

So there are a number of obstacles to making this easy. The increasing availability of geotagged sequences in GenBank (see Guest post: response to "Putting GenBank Data on the Map"), especially DNA barcodes, helps. For the demo I created a simple pipeline to take a DNA barcode, query BOLD for similar sequences, retrieve those, align them, build a neighbour joining tree, annotate the tree with latitude and longitudes, and encode that information in a NEXUS file.

To layout the tree on a map (say OpenStreetMap using Leaflet or Google Maps) I convert the NEXUS file to GeoJSON. There are a couple of problems to solve when doing this.Typically when drawing a phylogeny we compute x and y coordinates for a device such as a computer screen or printer where these coordinates have equal units and are linear in both horizontal and vertical dimensions. In web maps coordinates are expressed in terms of latitude and longitude, and in the widely-used Web Mercator projection the vertical axis (latitude) is non-linear. Furthermore, on a web map the user can zoom in and out, so pixel-based coordinates only make sense with respect to a given zoom level.

To tackle this I compute the layout of the tree in pixels at zoom level 0, when the web map comprises a single "tile".


The tile coordinates are then converted to latitude and longitude, so that they can be placed on the map. The map applications take care of zooming in and out, so the tree scales appropriately. The actual sampling localities are simply markers on the map. Another problem is to reduce the visual clutter that results from criss-crossing lines connecting connecting the tips of the tree and the associated sampling localities. To make the diagram more comprehensible, I adopt the approach used by GenGIS to reorder the nodes in the tree to minimise the crossings (see algorithm in doi:10.7155/jgaa.00088). The tree and the lines connecting it to the localities are encoded as "LineString" objects in the GeoJSON file.

There are a couple of things which could be done with this kind of tool. The first is to add it as a visualisation to a set of phylogenies or occurrence data. For example, imagine my "million barcode map" having the ability to display a geophylogeny for any barcode you click on.

Another use would be to create a geographically indexed database of phylogenies. There are databases such as CouchDB that store JSON as a native format, and it would be fairly straightforward to consume GeoJSON for a geophylogeny, ignore the bits that draw the tree on the map, and index the localities. We could then search for trees in a given region, and render them on a map.

There's still some work to do (I need to make the orientation of the tree optional and there are some edges case that need to be handled), but it's starting to reach the point when it's fun just to explore some examples, such as these microendemic Agnotecous crickets in New Caledonia (data from doi:10.1371/journal.pone.0048047 and GBIF).


Tuesday, January 20, 2015

Bitcoin, Xanadu, Ted Nelson, and the web that wasn't

A couple of articles in the tech press got me thinking this morning about Bitcoin, Ted Nelson, Xanadu, and the web that wasn't. The articles are After The Social Web, Here Comes The Trust Web and Transforming the web into a HTTPA 'database'. There are some really interesting ideas being explored based on centralised tracking of resources (including money, think Bitcoin, and other assets, think content). I wonder whether these developments may make lead to renewed interest in some of the ideas of Ted Nelson.

I've always had a soft-spot for Ted Nelson and his Xanadu project (see my earlier posts on translcusion and Nature's ENCODE app). To get a sense of what he was after, we can compare the web we have with what Nelson envisaged.

The web we have today:

  1. Links are one-way, in that it's easy to link to a site (just use the URL), but it's hard for the target site to find out who links to it. Put another way, like writing a scientific paper, it's easy to cite another paper, but non-trivial to find out who is citing your own work.
  2. Links to another document are simply launching pads to go to that other document, whether it's still there or not.
  3. Content is typically either "free" (paid for by advertising or in exchange for personal data), or behind a paywall and hence expensive.

Nelson wanted:

  1. Links that were bidirectional (so not only did you get cited, but you knew who was citing you)
  2. "Transclusion", where documents would not simply link (=cite) to other documents but would include snippets of those documents. If you cited another document, say to support a claim, you would actually include the relevant fragment of that document in your own document.
  3. A micropayment system so that if your work was "transcluded" you could get paid for that content.

The web we have is in many ways much easier to build, so Nelson's vision lost out. One-way links are easy to create (just paste in a URL), and the 404 error (that you get when a web page is missing) makes it robust to failure. If a page vanishes, things don't collapse, you just backtrack and go somewhere else.

Nelson had a more tightly linked web. He wanted to keep track of who links to whom automatically. Doing this today is the preserve of big operations such as Google (who count links to rank search results) or the Web of Science (who count citations to rank articles and journals - note that I'm using the web and the citation network pretty much interchangeably in this post). Because citation tracking isn't built into the web, you need create this feature, and that costs money (and hence nobody provides access to citation data for free).

In the current web, stuff (content) is either given away for "free" (or simply copied and pasted as if it was free), or locked behind paywalls. Free, of course, is never free. We are either handing over data, being the targets of advertising (better targeted the more data we hand over), or we pay for freedom directly (e.g., open access publication fees in the case of scientific articles). Alternatively, we have the paywalls well know to academics, where much of the world's knowledge is held behind expensive paywalls (in part because publishers need some way to make money, and there's little middle ground between free and expensive).

Nelson's model envisaged micropayments, where content creators would get small payments every time their content was used. Under the transclusion model, only small bits of your content might be used (in the context of a scientific paper, imagine just a single fact or statement was used). You didn't get everything for free (that would destroy the incentive to create), but nor was everything locked up behind prohibitively expensive paywalls. Nelson's model never took off, in part I suspect because there was simply no way to (a) track who was using the content, and (b) collect micropayments.

What is interesting is that Bitcoin seems to deal with the micropayments issue, and the HTTPA protocol (which uses much the same idea as Bitcoin to keep an audit trail of who has accessed and used data) may provide a mechanism to track usage. How is this going to change the current web? Might there be ways to use these ideas to reimagine academic publishing, which at the moment seems caught between steep open access fees or expensive journal subscriptions?