iPhylo: GBIC2012

Roderic D. M. Page

Showing posts with label GBIC2012. Show all posts

Thursday, April 18, 2013

Thoughts on GBIC 2012 and a vision of the future of biodiversity informatics

This seems to be the season for big, arm-wavy documents about the future of biodiversity informatics (see A decadal view of biodiversity informatics: challenges and priorities). An equivalent document is being drafted based on the Global Biodiversity Informatics Conference (GBIC 2012) conference. Writing these documents is hard work, they have to balance a set of conflicting visions, predict the future, and communicate a coherent plan to people who either could help make this happen, or feel they have a stake in the outcome.

Leaving all those constraints behind, and waving arms wildly, here's one take on the future of biodiversity informatics. I see three themes.

1. Knowing what we know

We have a limited grasp of how much we actually know, and crap tools to summarise this knowledge. I want a Google Analytics for biodiversity data where I can see at a glance the current state of our knowledge (e.g., what is the rate of sequencing of environmental samples in the Mediterranean? How much of Indonesia's amphibian fauna is in protected areas?). These are fairly trivial queries. If Google can analyse web traffic from sites being hit over a million times per day ( ~ 365 million hits per year) we can do the same thing on GBIF-scale databases. There is huge scope here for cool visualisation of the growth of our knowledge, such as this:

If biologists were explorers (Mammalia)... from Andrew W Hill on Vimeo.

Imagine the GBIF classification like this:

filesystem visualisation from wonderful websolutions on Vimeo.

2. Life stream

Terrible title, but this is where we monitor change, both "organic" and anthropogenic. This is where we use data mining to do a sentiment analysis of the biosphere, looking to detect changes such as outbreaks of disease, invasive species, etc. This builds on 1 but focusses on change. Imagine a "news service" for biology along the lines of tools available to financial markets (e.g., Silobreaker):

This is where we interface with decision makers, in the sense that Braulio Dias's statement "I am convinced that the lack of adequate biodiversity monitoring is at the heart of our difficulties to make convincing arguments" is true, this tackles that question.

3. Modelling the biosphere

Time to model all life on Earth (http://dx.doi.org/10.1038/493295a) is our equivalent of a moon shot (oh how I hate that analogy). Purves et al. have made the case, this is the task that will galvanise people outside the taxonomy/biodiversity community. This is real megascience (1. is data collection, 2. is data mining and analysis). Climate modellers and oceanographers get to do this:

Can we do the same?

Monday, July 23, 2012

Microbiome as climate, macrobiome as weather, and a global model of biodiversity

Half-baked idea time. Thinking about projects such as the Earth Microbiome Project and Genomic Observatories, the recent GBIC2012 meeting (I'm still digesting that meeting), and mulling over the book A Vast Machine I keep thinking about the possible parallels between climate science and biodiversity science.

One metaphor from "A Vast Machine" is the difference between "global data" and "making data global". Getting data from around the world ("global data") is one challenge, but then comes making that data global:

building complete, coherent, and consistent global data sets from incomplete, inconsistent, and heterogeneous datasources

The focus of GBIF's data portal is global data, bringing together specimen records and observations from all around the world. This is global data, but one could argue that it's not yet ready to be used for most applications. For example, GBIF doesn't give you the geographic distribution of a given species, merely where it's been recorded from (based on that subset of records that have been digitised). That's a very important start, but if we had for each species an estimated distribution based on museum records, observations, published maps, together with habitat modelling, then we'd be closer to a dataset that we could use to tackle key questions about the distribution of biodiversity.

EMP green small

But if we continue with the theme that microbiology is the dark matter of biology, and if we look at projects like the Earth Microbiome Project, then we could argue that focussing on eukaryote, particularly macro-eukaryote such as plants, fungi, and animals, may be a mistake. To use a crude analogy, perhaps we have been focussing on the big phenomena (equivalent to thunder storms, flash floods, tornados, etc.) rather than the underlying drivers (equivalent to climatic processes such as those captured in global climate models). Certainly, any attempt to model the biosphere is going to have to include the microbiome, and indeed perhaps the microbiome would be enough to have a working model of the biosphere?

I'm simply waving my arms around here (no, really?), but it's worth thinking about whether the macroecology that conservation and biodiversity focusses on is actually the important thing to consider if you want to model fundamental biological processes. Might macro-organisms be like the weather, and the microbiome is like the climate. As humans we notice the weather, because it is at a scale that affects us directly. But if the weather is a (not entirely predictable) consequence of the climate, what is the equivalent of global climate model for biodiversity?

Friday, July 06, 2012

Post GBIC2012 thoughts

I'm back from Copenhagen and GBIC2012. The meeting spanned three fairly intense days (with the days immediately before and after also working days for some of us), and was run by a group of facilitators lead by Natasha Walker, who were described us as "an interesting (and delightfully brainy, if sometimes scatty) group of academics, researchers, museum managers and people close to policy...". I've attempted to capture tweets about the meeting using Storify.

There will be a document (perhaps several) based on the meeting, but until then here are a few quick thoughts. Note that the comments below are my own and you shouldn't read into this anything about what directions the GBIC document(s) will actually take.

Microbiology rocks

Highlight of the first day was Robert J. Robbin's talk which urged the audience to consider that life was mostly microbial, that the the things most people in the room cared about were actually merely a few twigs on the tree of life, that the tree of life didn't actually exist anyway, and many of the concepts that made sense for multicellular organisms simply didn't apply in the microbial world. Basically it was a homage to Carl Woese (see also Pace et al. 2012 doi:10.1073/pnas.1109716109) and a wake up call to biodiversity informaticians to stop viewing the world through multicellular eyes. (You can find all the keynotes from the first day here).

F1 large

From Pace, N. R. (1997). A Molecular View of Microbial Diversity and the Biosphere. Science, 276(5313), 734–740. doi:10.1126/science.276.5313.734

Sequences rule

The future of a lot of biodiversity science belongs to sequences, from simple DNA barcoding as a tool for species discovery and identification, metabarcoding as a tool for community analysis, to comparisons of metabolic pathways and beyond. The challenge for classical biodiversity informatics is how to engage with this, and to what extent we should try and map between, say sequences and classical taxa, or whether it might make more sense (gasp) to abandon the taxonomic legacy and move on. Perhaps are more nuanced response is that the point of connection between sequences and classical biodiversity data is unlikely to be at the level of taxonomic names (which are mostly tags for collections of things that look similar) but at the level of specimens and observations.

Ontologies considered harmful

This is my own particular hobby horse. Often the call would come "we need an ontology", to which I respond read Ontology is Overrated: Categories, Links, and Tags. I have several problems with ontologies. The first is that they are too easy to make and distract from the real problem. From my perspective a big challenge is linking data together, that is going from

Let's leave aside what "A" and "B" are (I suspect it matters less than people think), once we have the link then we can can start to do stuff. From my perspective, what ontologies give us is basically this:

So now we know the "type" of the link (e.g., "is a part of", "cites approvingly", etc.). I'm not arguing that this isn't useful to have, but if you don't have the network of links then typing the links becomes an idle exercise.

To give an example, the web itself can be modelled as simply nodes connected by links, ignoring the nature of the links between the web pages. The importance of those links can be inferred later from properties of the network. To a first approximation this is how Google works, it doesn't ask what the links "mean" it simply investigates the connections to determine how important each web page is. In the same way, we build citation networks without bothering to ask the nature of the citation (yes I know there are ontologies for citations, but anyone willing to bet how widely they'll be adopted?).

My second complaint is that building ontologies is easy, "easy" in the sense that get a bunch of people together, they squabble for a long time about terminology, and out comes an ontology. Maybe, if you're lucky, someone will adopt it. The cost of making ontologies, and indeed of adopting them is relatively low (although it might not seem like it at the time). The cost of linking data is, I'd argue, higher, because it requires that you trust someone else's identifiers to the extent that you use them for things you care about deeply. Consider the citation network that is emerging from the widespread adoption of DOIs by the publishing industry. Once people trust that the endpoints of the links will survive, then the network starts to grow. But without that trust, that leap of faith, there's no network (unless you have enough resources to build the whole thing internally yourself, which is what happened with the closed citation network owned by Thomson Reuters). It's much easier to silo the data using unique identifiers than it is to link to other data (it's a variant of the "not invented here" syndrome).

Lastly, ontologies can have short lives. They reflect a certain world view that can become out of date, or supplanted if the relationships between things that the ontology cares about can be computed using other data. For example, biological taxonomy is a huge ontology that is rapidly being supplanted by phylogenetic trees computed from sequence (and other) data (compare the classification used by flagship biodiversity projects like GBIF and EOL with the Pace tree of life shown above). Who needs an ontology when you can infer the actual relationships? Likewise, once you have GPS the value of a geographic ontology (say of place names) starts to decline. I can compute if I'm on a mountain simply by knowing where I am.

I'm not saying ontologies are always bad (they're not), nor that they can't be cool (they can be), I'm just suggesting that they aren't the first thing you need. And they certainly aren't a prerequisite for linking stuff together.

Google flu trends

Perhaps the most interesting idea that emerged was the notion of intelligently detecting changes in biodiversity (which is the kind of thing a lot of people want to know) in the way analogous to Google.org's Flu Trends uses flu-related search terms to predict flu outbreaks:

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2008). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–1014. doi:10.1038/nature07634

Could we do something like this for biodiversity data? For various reasons this suggestion become known at GBIC2012 as the "Heidorn paradigm".

Thinking globally

One challenge for a meeting like GBIC 2012 is scope. There's so much cool stuff to think about. From my perspective, a useful filter is to ask "what will happen anyway?" In other words, there is a lot of stuff (for example the growth of metabarcoding) that will happen regardless of anything the biodiversity informatics community does. People will make taxon-specific ontologies for organismal traits, digitise collections, assess biodiversity, etc. without necessarily requiring an entity like GBIF. The key question is "what won't happen at a global scale unless GBIF (or some other entity) gets involved?"

A Vast Machine

Lastly, in one session Tom Moritz mentioned a book that he felt we could learn from (A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming). The book recounts the history of climatology and its slow transition to a truly global science. I've started to read it, and it's fascinating to see the interplay between early visions of the future, and the technology (typically driven by military or large-scale commercial interests) that made possible the realisation of those visions. This is one reason why predicting the future is such a futile activity, the things that have the biggest effect come from unexpected sources, and effect things in ways it's hard to anticipate. On a final note, it took about a minute from the time from the time Tom mentioned the book to the time I had a copy from Amazon in the Kindle app on my iPad. Oh that accessing biodiversity data were that simple.

Friday, June 29, 2012

Planet management, GBIF, and the future of biodiversity informatics

Earth russia large verge medium landscape

Next week I'm in Copenhagen for GBIC, the Global Biodiversity Informatics Conference. The goal of the conference is to:

...convene expertise in the fields of biodiversity informatics, genomics, earth observation, natural history collections, biodiversity research and policy needed to set such collaboration in motion.

The collaboration referred to is the agreement to mobilise data and informatics capability to met the Aichi Biodiversity Targets.

I confess I have mixed feelings about the upcoming meeting. There will be something like 100 people attending the conference, with backgrounds ranging from pure science to intergovernmental policy. It promises to be interesting, but whether a clear vision of the future of biodiversity informatics will emerge is another matter.

GBIC is part of the process of "planet management", a phrase that's been around for a while, but I only came across in the Bowker's essay "Biodiversity Datadiversity"¹:

Bowker, G. C. (2000). Biodiversity Datadiversity. Social Studies of Science, 30(5), 643–683. doi:10.1177/030631200030005001

Bowker's essay is well worth a read, not least for the choice quotes such as:

Each particular discipline associated with biodiversity has its own incompletely articulated series of objects. These objects each enfold an organizational history and subtend a particular temporality or spatiality. They frequently are incompletely articulated with other objects, temporalities and spatialities — often legacy versions, when drawing on non-proximate disciplines. If one wants to produce a consistent, long-term database of biodiversity-relevant information the world over, all this sounds like an unholy mess. At the very least it suggests that global panopticons are not the way to go in biodiversity data. (p. 675, emphasis added)

and

I have not, in general, questioned the mania to name which is rife in the circles whose work I have described. There is no absolutely compelling connection between the observation that many of the world’s species are dying and the attempt to catalogue the world before they do. If your house is on ﬁre, you do not necessarily stop to inventory the contents before diving out the window. However, as Jack Goody (1977) and others have observed, list-keeping is at the heart of our body politic. It is also, by extension, at the heart of our scientific strategies. Right or wrong, it is what we do. (p. 676, emphasis added)

Given that I'm a fan of the notion of a "global panopticon", and spend a lot of time fussing with lists of names, I find Bowker's views refreshing. Meantime, roll on GBIC2012.

1. Bowker cites Elichirigoity as a source of the term "planet management":

Fernando Elichirigoity (1999), Planet Management: Limits to Growth,
Computer Simulations, and the Emergence of Global Spaces (Evanston, IL: Northwestern
University Press). ISBN 0810115875 (Google Books oP3wVnKpGDkC).

From the limited Google preview, and the review by Edwards, this looks like an interesting book:

Edwards, P. (2000). Book Review:Planet Management: Limits to Growth, Computer Simulation, and the Emergence of Global Spaces Fernando Elichirigoity. Isis, 91(4), 828. doi:10.1086/385020 (PDF here)