Wednesday, January 28, 2015

Annotating GBIF, from datasets to nanopublications

Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

  1. A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
  2. A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
  3. A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.


As an example, consider the following two occurrences for Psilogramma menephron:

occurrencetaxonlongitudelatitudecatalogue numbersequence
887386322Psilogramma menephron Cramer, 1780145.86301-17.44BC ZSM Lep 01337
1009633027Psilogramma menephron Cramer, 1780145.86-17.44KJ168695KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIs

In terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.


If we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are:
A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.
A nanopublication comprises three elements:
  1. The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
  2. The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
  3. The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112
Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?

Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.


My goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.

Thursday, January 22, 2015

GeoJSON and geophylogenies

For the last few weeks I've been working on a little project to display phylogenies on web-based maps such as OpenStreetMap and Google Maps. Below I'll sketch out the rationale, but if you're in a hurry you can see a live demo here:, and some examples below.

The first is the well-known example of Banza katydids from doi:10.1016/j.ympev.2006.04.006, which I used in 2007 when playing with Google Earth.


The second example shows DNA barcodes similar to ABFG379-10 for Proechimys guyannensis and its relatives.



People have been putting phylogenies on computer-based maps for a while, but in most cases these have required stand-alone software, such as Google Earth, or GeoJSON for encoding geographic information. Despite the obvious appeal of placing trees in maps, and calls for large-scale geophylogeny databases (e.g., do:10.1093/sysbio/syq043), computerised drawing trees on maps has remained a bit of a niche activity. I think there are several reasons for this:

  1. Drawing trees on maps needs both a tree and geographic localities for the nodes in the tree. The later are not always readily available, or may be in different databases to the source of phylogenetic data.
  2. There's no accepted standard for encoding geographic information associated with the leaves in a tree, so everyone pretty much invents their own format.
  3. To draw the tree we typically need standalone software. This means users have to download software, instead of work on the web (which is where all the data is).
  4. Geographic formats such as KML (used by Google Earth) are not particularly easy to store and index in databases.

So there are a number of obstacles to making this easy. The increasing availability of geotagged sequences in GenBank (see Guest post: response to "Putting GenBank Data on the Map"), especially DNA barcodes, helps. For the demo I created a simple pipeline to take a DNA barcode, query BOLD for similar sequences, retrieve those, align them, build a neighbour joining tree, annotate the tree with latitude and longitudes, and encode that information in a NEXUS file.

To layout the tree on a map (say OpenStreetMap using Leaflet or Google Maps) I convert the NEXUS file to GeoJSON. There are a couple of problems to solve when doing this.Typically when drawing a phylogeny we compute x and y coordinates for a device such as a computer screen or printer where these coordinates have equal units and are linear in both horizontal and vertical dimensions. In web maps coordinates are expressed in terms of latitude and longitude, and in the widely-used Web Mercator projection the vertical axis (latitude) is non-linear. Furthermore, on a web map the user can zoom in and out, so pixel-based coordinates only make sense with respect to a given zoom level.

To tackle this I compute the layout of the tree in pixels at zoom level 0, when the web map comprises a single "tile".


The tile coordinates are then converted to latitude and longitude, so that they can be placed on the map. The map applications take care of zooming in and out, so the tree scales appropriately. The actual sampling localities are simply markers on the map. Another problem is to reduce the visual clutter that results from criss-crossing lines connecting connecting the tips of the tree and the associated sampling localities. To make the diagram more comprehensible, I adopt the approach used by GenGIS to reorder the nodes in the tree to minimise the crossings (see algorithm in doi:10.7155/jgaa.00088). The tree and the lines connecting it to the localities are encoded as "LineString" objects in the GeoJSON file.

There are a couple of things which could be done with this kind of tool. The first is to add it as a visualisation to a set of phylogenies or occurrence data. For example, imagine my "million barcode map" having the ability to display a geophylogeny for any barcode you click on.

Another use would be to create a geographically indexed database of phylogenies. There are databases such as CouchDB that store JSON as a native format, and it would be fairly straightforward to consume GeoJSON for a geophylogeny, ignore the bits that draw the tree on the map, and index the localities. We could then search for trees in a given region, and render them on a map.

There's still some work to do (I need to make the orientation of the tree optional and there are some edges case that need to be handled), but it's starting to reach the point when it's fun just to explore some examples, such as these microendemic Agnotecous crickets in New Caledonia (data from doi:10.1371/journal.pone.0048047 and GBIF).


Tuesday, January 20, 2015

Bitcoin, Xanadu, Ted Nelson, and the web that wasn't

A couple of articles in the tech press got me thinking this morning about Bitcoin, Ted Nelson, Xanadu, and the web that wasn't. The articles are After The Social Web, Here Comes The Trust Web and Transforming the web into a HTTPA 'database'. There are some really interesting ideas being explored based on centralised tracking of resources (including money, think Bitcoin, and other assets, think content). I wonder whether these developments may make lead to renewed interest in some of the ideas of Ted Nelson.

I've always had a soft-spot for Ted Nelson and his Xanadu project (see my earlier posts on translcusion and Nature's ENCODE app). To get a sense of what he was after, we can compare the web we have with what Nelson envisaged.

The web we have today:

  1. Links are one-way, in that it's easy to link to a site (just use the URL), but it's hard for the target site to find out who links to it. Put another way, like writing a scientific paper, it's easy to cite another paper, but non-trivial to find out who is citing your own work.
  2. Links to another document are simply launching pads to go to that other document, whether it's still there or not.
  3. Content is typically either "free" (paid for by advertising or in exchange for personal data), or behind a paywall and hence expensive.

Nelson wanted:

  1. Links that were bidirectional (so not only did you get cited, but you knew who was citing you)
  2. "Transclusion", where documents would not simply link (=cite) to other documents but would include snippets of those documents. If you cited another document, say to support a claim, you would actually include the relevant fragment of that document in your own document.
  3. A micropayment system so that if your work was "transcluded" you could get paid for that content.

The web we have is in many ways much easier to build, so Nelson's vision lost out. One-way links are easy to create (just paste in a URL), and the 404 error (that you get when a web page is missing) makes it robust to failure. If a page vanishes, things don't collapse, you just backtrack and go somewhere else.

Nelson had a more tightly linked web. He wanted to keep track of who links to whom automatically. Doing this today is the preserve of big operations such as Google (who count links to rank search results) or the Web of Science (who count citations to rank articles and journals - note that I'm using the web and the citation network pretty much interchangeably in this post). Because citation tracking isn't built into the web, you need create this feature, and that costs money (and hence nobody provides access to citation data for free).

In the current web, stuff (content) is either given away for "free" (or simply copied and pasted as if it was free), or locked behind paywalls. Free, of course, is never free. We are either handing over data, being the targets of advertising (better targeted the more data we hand over), or we pay for freedom directly (e.g., open access publication fees in the case of scientific articles). Alternatively, we have the paywalls well know to academics, where much of the world's knowledge is held behind expensive paywalls (in part because publishers need some way to make money, and there's little middle ground between free and expensive).

Nelson's model envisaged micropayments, where content creators would get small payments every time their content was used. Under the transclusion model, only small bits of your content might be used (in the context of a scientific paper, imagine just a single fact or statement was used). You didn't get everything for free (that would destroy the incentive to create), but nor was everything locked up behind prohibitively expensive paywalls. Nelson's model never took off, in part I suspect because there was simply no way to (a) track who was using the content, and (b) collect micropayments.

What is interesting is that Bitcoin seems to deal with the micropayments issue, and the HTTPA protocol (which uses much the same idea as Bitcoin to keep an audit trail of who has accessed and used data) may provide a mechanism to track usage. How is this going to change the current web? Might there be ways to use these ideas to reimagine academic publishing, which at the moment seems caught between steep open access fees or expensive journal subscriptions?

Friday, January 09, 2015

GBIF, biodiversity informatics and the "platform rant"

Each year about this time, as I ponder what to devote my time on in the coming year, I get exasperated and frustrated that each year will be like the previous one, and biodiversity informatics will seem no closer to getting its act together. Sure, we are putting more and more data online, but we are no closer to linking this stuff together, or building things that people can use to do cool science with. And each year I try and figure out why we are still flaying about and not getting very far. This year, I've settled on the lack of "platforms".

In 2011 Steve Yegge (accidentally) published a widely read document known as the "Google Platforms Rant". It's become something of a classic, and I wonder if biodiversity informatics can learn from this rant (it's long but well worth a read).

One way to think about this is to look at how we build things. In the early days, people would have some data and build a web site:


In the diagram above "dev" is the web developer who builds the site, and "DBA" is the person who manages the data (for many projects this is one and the same person). The user is presented with a web site, and that's the only way they can access the data. If the web site is well designed this typically works OK, but the user will come up against limitations. Why do I have to manually search for each record? How can I combine this data with some other data? These questions lead to some users doing things like screen scrapping, anything to get the data and do more than the web site permits (I spend a lot of my time doing exactly this). In contrast, the person (or team) building the site ("dev") can access the data and tools directly.

Eventually some sites realise that they could add value to their users if they added an API, so typically we get something like this:


Now we have an API (yay), but notice that it is completely separate from the web site. Now the site developers have to manage two different things, and two sets of users (web site visitors, and user programming against the API). Because the site and the API are different, and the site gets more users, typically what happens is the API lacks much of the functionality of the site, which frustrates users of the API. For example, when Mendeley launched it's API its limited functionality and lack of documentation drove me nuts. Similarly, the Encyclopedia of Life (EOL) API is pretty sucky. If anyone from EOL is reading this, for the love of God add user authentication and the ability to create and edit collections to the API. Until you do, you'll never have an ecosystem of apps.

A solution to sucky APIs is "dogfooding":


Dogfooding is the idea that your product is so good you'd use it yourself. In the case of web development, if we build the web site on top of the same API that we expose to users, then the site developers have a strong incentive to make the API well-documented and robust, because their web site runs on the same services. As a result the interests of the web developers and users who are programmers are much more aligned. If a user finds a bug in the API, or the API lacks a feature, it's much more likely to get fixed. An example of a biodiversity informatics project that "gets" dogfooding is GBIF, which has a nice API that powers much of their web site. This is a good example of how to tell if an API is any good, namely, can you recreate the web site yourself just using the API?

But the example above leaves one aspect of the whole system still intact and not accessible to users. Typically a company or organisation has data, tools, and processes that it uses to manage whatever is central to its operations. These are kept separate from users, who only get to access these indirectly through the web site or the API.

A "platform" takes things one step further. Steve Yegge summarises Jeff Bezos' memo that outlined Amazon's move to a platform:

  1. All teams will henceforth expose their data and functionality through service interfaces.
  2. Teams must communicate with each other through these interfaces.
  3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
  4. It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.
  5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
  6. Anyone who doesn't do this will be fired.
  7. Thank you; have a nice day!
All the core bits of infrastructure that powered Amazon were to become services, and different bits of Amazon could only talk to each other through these services. The point of this is that it enabled Amazon to expose its infrastructure to the outside world (AKA paying customers) and now we have Amazon cloud services for storing data, running compute jobs, and so on. By exposing its infrastructure as services, Amazon now runs a big chunk of the startup economy. By insisting that Amazon itself uses these services (dogfooding at the infrastructure level), Amazon ensures that this infrastructure works (because its own business depends on it).


There are some things Google does that are like a platform (despite the complaints in the "Google Platforms Rant"). For example, you could imagine that most workers at Google use tools such as Google Docs to create and share documents. Likewise, Google Scholar is unlikely to be a simple act of altruism. If you have a team of world class researchers you need a tool that enables them to find existing research. Google Scholar does this. If you then expose it to the outside world you get more users, and an incentive for commercial publishers to open up their paywall journals to being indexed by Google's crawlers, and incentive that would be missing if Scholar was purely an internal service.

Now, giant companies like Amazon and Google might seem a world away from biodiversity informatics, but I think there are things we can learn from this. Looking around, I think there are other examples of platforms that may seem closer to home. For example, the NCBI runs GenBank and PubMed, and these are very like platforms. GenBank provides tools, such as BLAST, that it provides to the user community, but which it also uses internally to cluster sequences into related sets. Consider PubMed, which has gone from a simple index to the biomedical literature to a publishing platform. PubMed has driven the standardisation of XML across biomedical publishers. It is quite possible to visit the NCBI site, explore data, then read full text for the associated publications in PubMed Central, without ever leaving the NCBI site. No wonder some commercial publishers are deeply worried about PubMed Central.

A key thing about platforms is that the people running the platform have a deep interest in many of the same things as the users of that platform (note the "users" scattered all over the platform diagram above). Instead of user being a separate category that you try and serve by figuring out what they want, developers are users too.

To try and flesh this out a little more, what would a "taxonomic" platform look like? At the moment, we have lots of taxonomic web sites that pump out lists of names and little else. This is not terribly useful. If we think about what goes into making lists of names, it requires access to the scientific literature, it requires being able to read that literature and extract statements about names (e.g., this is the original description, these two names are synonyms, etc.), and it requires some way of summarising what we know about those names and the taxa that we label with those names. Typically these are all things that happen behind the scenes, then the user simply gets a list of names. A platform would expose all of the data, tools, and processes that went into making that list. It would provide the literature in both human and computer readable forms, it would provide tools for extracting information, tools to store knowledge about those names, and tools to make inferences using that knowledge. All of these would be exposed to users. And these some services and tools would be used by the people building those services and tools.

This last point means that you also need people working on the same problems as "users". For example, consider something like GBIF. At the moment GBIF consumes output of taxonomic research (such as lists of names) and tries to make sense of these before serving them back to the community. There is little alignment between the interests of taxonomists and GBIF itself. For GBIF to become a taxonomic platform, it would need to provide the data, tools and services for people to do taxonomic research, and ideally it would actually have taxonomists working at GBIF using those tools (these taxonomists could, for example, be visiting fellows working on particular taxa, rather than permanent employees). These tools would greatly help the taxonomic community, but also help GBIF make sense of the millions of names it has to interpret.

It's important to note here the the goal of the platform is NOT to "help" users - that simply reinforces the distinction between you and the "users". Instead it is to become a user. You may have more resources, and work on a different scale (few business Amazon's services support will be anything like as big as Amazon), but you are ultimately "just" another user.