iPhylo: What I'll be working on in 2014: knowledge graphs and Google forests

Roderic D. M. Page

Wednesday, January 15, 2014

What I'll be working on in 2014: knowledge graphs and Google forests

More for my own benefit than anything else I've decided to list some of the things I plan to work on this year. If nothing else, it may make sobering reading this time next year.

A knowledge graph for biodiversity

Google's introduction of the "knowledge graph" gives us a happy phrase to use when talking about linking stuff together. It doesn't come with all the baggage of the "semantic web", or the ambiguity of "knowledge base". The diagram below is my mental model of the biodiversity knowledge graph (this comes from http://dx.doi.org/10.7717/peerj.190, but I sketched most of this for my Elsevier Challenge entry in 2008, see http://dx.doi.org/10.1038/npre.2008.2579.1).

Fig 1 1x

Parts of this knowledge graph are familiar: articles are published in journals, and have authors. Articles cite other articles (represented by a loop in the diagram below). The topology of this graph gives us citation counts (number of times an article has been cited), impact factor (citations for articles in a given journal), and author-based measures such as the H-index (a function of the distribution of citations for each article you have authored). Beyond simple metrics this graph also gives us the means to track the provenance of an idea (by following the citation trail).

Publication

The next step is to grow this graph to include the other things we care about (e.g., taxa, taxon names, specimens, sequences, phylogenies, localities, etc.).

BioNames

I spent a good deal of last year building BioNames (for background see my blog posts or read the paper in PeerJ http://dx.doi.org/10.7717/peerj.190). BioNames represents a small corner of the biodiversity knowledge graph, namely taxonomic names and their associated publications (with added chocolately goodness of links to taxon concepts and phylogenies). In 2014 I'll continue to clean this data (I seem to be forever cleaning data). So far BioNames is restricted to animal names, but now that the plant folks have relaxed their previously restrictive licensing of plant data (see post on TAXACOM) I'm looking at adding the million or so plant names (once I've linked as many as possible to digital identifiers for the corresponding publications).

Spatial indexing

Now that I've become more involved in GBIF I'm spending more time thinking about spatial indexing, and our ability to find biodiversity data on a map. There's a great Google ad that appeared on UK TV late last year. In it, Julian Bayliss recounts the use of Google Earth to discover of virgin rainforest (the "Google forest") on Mount Mabu in Mozambique.

It's a great story, but I keep looking at this and wondering "how did we know that we didn't know anything about Mount Mabu?" In other words, can we go to any part of the world and see what we know about that area? GBIF goes a little way there with its specimen distribution maps, which gives some idea of what is now known from Mount Mabu (although the map layers used by GBIF are terrible compared to what Google offers).

Mabu

But I want to be able to see all the specimens now known from this region (including the new species that have been discovered, e.g. see http://dx.doi.org/10.1007/s12225-011-9277-9 and http://dx.doi.org/10.1080/21564574.2010.516275). Why can't I have a list of publications relevant to this area (e.g., species descriptions, range extensions, ecological studies, conservation reports)? What about DNA sequences from material in this region (e.g., from organismal samples, DNA barcodes, metagenomics, etc.)? If GBIF is to truly be a "Global Biodiversity Information Facility" then I want it to be able to provide me with a lot more information than it currently does. The challenge is how to enable that to happen.