iPhylo: July 2012

Roderic D. M. Page

Tuesday, July 24, 2012

Dear GBIF, please stop changing occurrenceIDs!

If we are ever going to link biodiversity data together we need to have some way of ensuring persistent links between digital records. This isn't going to happen unless people take persistent identifiers seriously.

I've been trying to link specimen codes in publications to GBIF, with some success, so imagine my horror when it started to fall apart. For example, I recent added this paper to BioStor:

A remarkable new asterophryine microhylid frog from the mountains of New Guinea. Memoirs of The Queensland Museum 37: 281-286 (1994) http://biostor.org/reference/105389

This paper describes a new frog (i>Asterophrys leucopus) from New Guinea, and BioStor has extracted the specimen code QM J58650 (where "QM" is the abbreviation for Queensland Museum), which according to the local copy of GBIF data that I have, corresponds to http://data.gbif.org/occurrences/363089399/. Unfortunately, if you click on that link GBIF denies all knowledge (you get bounced to the search page). After a bit of digging I discover that specimen is now in GBIF as http://data.gbif.org/occurrences/478001337/. At some point GBIF has updated its data and the old occurrenceID for QM J58650 (363089399) has been deleted. Noooo!

Looking at the old record I have there is an additional identifier:

urn:catalog:QM: Herpetology:J58650

This is a URN, and it's (a) unresolvable and (b) invalid as it contains a space. This is why URNs are useless. There's no expectation they will be resolvable hence there's no incentive to make sure they are correct. It's as much use as writing software code but not bothering to run it (because surely it will work, no?).

The GBIF record http://data.gbif.org/occurrences/478001337/ contains a UUID as an alternative identifier:

bc58ce6b-3cc3-459a-9f5b-4a70a026afbe

If you Google this you discover a record in the Atlas of Living Australia http://biocache.ala.org.au/occurrences/bc58ce6b-3cc3-459a-9f5b-4a70a026afbe, which also lists the URN from the now deleted GBIF record http://data.gbif.org/occurrences/363089399/.

I'm guessing that at some point the OZCAM data provided to GBIF was updated and instead of updating data for existing occurrenceIDs the old ones were deleted and new ones created (possibly because OZCAM switched from URNs to UUIDs as alternative identifiers). Whatever the reason, I will now need to get a new copy of GBIF occurrence data and repeat the linking process. Sigh.

If we are ever going to deliver on the promise of linking biodiversity data together we need to take identifiers seriously. Meantime I need to think about mechanisms to handle links that disappear on a whim.

Monday, July 23, 2012

Microbiome as climate, macrobiome as weather, and a global model of biodiversity

Half-baked idea time. Thinking about projects such as the Earth Microbiome Project and Genomic Observatories, the recent GBIC2012 meeting (I'm still digesting that meeting), and mulling over the book A Vast Machine I keep thinking about the possible parallels between climate science and biodiversity science.

One metaphor from "A Vast Machine" is the difference between "global data" and "making data global". Getting data from around the world ("global data") is one challenge, but then comes making that data global:

building complete, coherent, and consistent global data sets from incomplete, inconsistent, and heterogeneous datasources

The focus of GBIF's data portal is global data, bringing together specimen records and observations from all around the world. This is global data, but one could argue that it's not yet ready to be used for most applications. For example, GBIF doesn't give you the geographic distribution of a given species, merely where it's been recorded from (based on that subset of records that have been digitised). That's a very important start, but if we had for each species an estimated distribution based on museum records, observations, published maps, together with habitat modelling, then we'd be closer to a dataset that we could use to tackle key questions about the distribution of biodiversity.

EMP green small

But if we continue with the theme that microbiology is the dark matter of biology, and if we look at projects like the Earth Microbiome Project, then we could argue that focussing on eukaryote, particularly macro-eukaryote such as plants, fungi, and animals, may be a mistake. To use a crude analogy, perhaps we have been focussing on the big phenomena (equivalent to thunder storms, flash floods, tornados, etc.) rather than the underlying drivers (equivalent to climatic processes such as those captured in global climate models). Certainly, any attempt to model the biosphere is going to have to include the microbiome, and indeed perhaps the microbiome would be enough to have a working model of the biosphere?

I'm simply waving my arms around here (no, really?), but it's worth thinking about whether the macroecology that conservation and biodiversity focusses on is actually the important thing to consider if you want to model fundamental biological processes. Might macro-organisms be like the weather, and the microbiome is like the climate. As humans we notice the weather, because it is at a scale that affects us directly. But if the weather is a (not entirely predictable) consequence of the climate, what is the equivalent of global climate model for biodiversity?

Friday, July 20, 2012

Figshare and F1000 integrate data into publication: could TreeBASE do the same?

Quick thoughts on the recent announcement by figshare and F1000 about the new journals being launched on the F1000 Research site. The articles being published have data sets embedded as figshare widgets in the body of the text, instead of being, say, a static table. For example, the article:

Oliver, G. (2012). Considerations for clinical read alignment and mutational profiling using next-generation sequencing. F1000 Research. doi:10.3410/f1000research.1-2.v1

has a widget that looks like this:

Widget

You can interact with this widget to view the data. Because the data are in figshare those data are independently citable, e.g. the dataset "Simulated Illumina BRCA1 reads in FASTQ format" has a DOI http://dx.doi.org/10.6084/m9.figshare.92338.

Now, wouldn't it be cool if TreeBASE did something similar? Imagine if uploading trees to TreeBASE were easy, and that you didn't have to have published yet, you just wanted to store the trees and make them citable. Imagine if TreeBASE had a nice tree viewer (no, not a Java applet, a nice viewer that uses SVG, for exmaple). Imagine if you could embed that tree viewer as a widget when you published your results. It's a win all round. People have an incentive to upload trees (nice viewer, place to store them, and others can cite the trees because they'd have DOIs). TreeBASE builds its database a lot more quickly (make it dead easy to upload tree), and then as more publishers adopt this style of publishing TreeBASE is well placed to provide nice visualisations of phylogenies pre-packaged, interactive, and citable. And let's not stop there, how about a nice alignment viewer? Perhaps this is the something currently rather moribund PLoS Currents Tree of Life could think about supporting?

Tuesday, July 17, 2012

Building a BHL Africa: BHL in a box

Was going to post this as a comment on the BHL blog but they use Blogger's native comment system, which is horrible, and it refused to accept my comment (yes, yes, I'm sure it did that on grounds of taste). I read the recent post Building a BHL Africa and couldn't believe my eyes when I read the following:

the "BHL in a Box" concept was highly desired. This would entail creating interactive CDs of BHL content for distribution in areas where internet access is unreliable or unavailable.

CDs! Really? Surely this is crazy!?. You want to use an obsolete technology that require additional obsolete technology to ship BHL around Africa? Why not ship relevant parts of BHL on iPads? Lots more storage space than CDs, built-in interactivity (obviously need to write an app, but could use HTML + Javascript as a starting point), long battery life, portable, comes with 3G support if needed. I'll be the first to admit that my knowledge of Africa is about zero, but given that mobile devices are common, mobile networks are fairly well developed, and tablets are making inroads (see iPad has become a big factor in African business) surely "BHL mobile" is the way to go to provide "BHL in a box", not CDs.

Why not develop an app that stores BHL content on a device like an iPad, then distribute those? Support updating the content over the network so the user isn't stuck with content they no longer need. In effect, something like Amazon's Kindle app or iBooks would do the trick. You'd need to compress BHL content to keep the size down (the images BHL currently displays on its web site could be made a lot smaller) but this is doable. Indeed, the BHL Africa could be an ideal motivation to move BHL to platforms such as phones and tablets, where at the moment users have to struggle with a website that makes no concessions to those devices.

Postscript
Of course, it doesn't have to be the iPad as such. Imagine if BHL published books and articles on Amazon, then used Kindle to deliver content physically (i.e., ship Kindles), and anyone else could access it directly from Amazon using their Kindle (or Kindle app on iPad).

Friday, July 13, 2012

Sometimes the mess taxonomy creates drives me nuts

Playing with some sequence data I found numerous Plasmodium sequences from the following paper:

Werner, E. B. ., Taylor, W. R., & Holder, A. A. (1998). A Plasmodium chabaudi protein contains a repetitive region with a predicted spectrin-like structure1Note: Nucleotide sequence data reported in this paper are available in the EMBL, GenBank™ and DDJB databases under the accession number U43145.1. Molecular and Biochemical Parasitology, 94(2), 185–196. doi:10.1016/S0166-6851(98)00067-X

These sequences (e.g., U43145) give the host as Thamnomys rutilans. You'd think it would be fairly easy to learn more about this animal, given that it hosts a relative of the cause of malaria in humans, and indeed there are a number of biomedical papers that come up in Google, e.g.:

Landau, I., & Chabaud, A. (1994). Advances in Parasitology (Vol. 33, pp. 49–90). Elsevier BV. doi:10.1016/S0065-308X(08)60411-X

Killick-Kendrick, R. (1968). Malaria parasites of Thamnomys rutilans (Rodentia, Muridae) in Nigeria. Bull World Health Organ. 1968; 38(5): 822–824. PMC2554675

Google also tells me that Thamnomys rutilans is an African rodent (e.g., 6.1.6. Rodent malaria, but NCBI has no sequences for "Thamnomys rutilans", and GBIF has no data on its distribution. If I search Mammal Species of the World I get (literally) "nothing found ...".

So, this is an African rodent, host to Plasmodium, and we know nothing about it? A bit of Googling, a trip to Wikipedia and Google Books reveals that Thamnomys rutilans is a synonym of Grammomys rutilans, but it is now called Grammomys poensis because the original name (Mus rutilans Peters 1876) is a junior ~~synonym~~ homonym of Mus rutilans Olfers, 1818 (simples). You can see the original description of Mus rutilans Peters 1876 in BioStor http://biostor.org/reference/105261 (this took some tracking down, but that's another story):

4ca1a4521753bde9a091661c7694f8ae

The original description of Mus rutilans Olfers, 1818 is given by The description of a new species of South American hocicudo, or long-nose mouse, genus Oxymycterus (Sigmodontinae, Muroidea), with a critical review of the generic content as:

Olfers, I. 1818. Bemerkungen zu Illiger's Ueberblick der Saugethiere nach ihrer Betheilung über die Welttheile rüchsichtlich der Südamerikanischen Arten (Species). In Eschwege, W. L., ed., Journal von Brasilien, Weimar, 15(2): 192-237.

This reference doesn't seem to be online.

The upshot of all this information about the host of Plasmodium chabaudi is hidden behind taxonomic name changes, and databases that one might expect to help simply don't. If names are the glue that link biodiversity data together then we need to get a lot better at making basic information about name changes accessible, otherwise we are creating islands of disconnected data.

Thursday, July 12, 2012

Dimly lit taxa - guest post by Bob Mesibov

The following is a first for iPhylo, a guest post by Bob Mesibov. Bob

Rod Page introduced 'dark taxa' here on iPhylo in April 2011. He wrote:

The bulk of newly added taxa in GenBank are what we might term "dark taxa", that is, taxa that aren't identified to a known species. This doesn't necessarily mean that they are species new to science, we may already have encountered these species before, they may be sitting in museum collections, and have descriptions already published. We simply don't know. As the output from DNA barcoding grows, the number of dark taxa will only increase, and macroscopic biology starts to look a lot like microbiology.

Rod suggested that 'quite a lot' of biology can be done without taxonomic names. For the dark taxa in GenBank, that might well mean doing biology without organisms – a surprising thought if you're a whole-organism biologist.

Non-taxonomists may be surprised to learn that a lot of taxonomy is also done, in fact, without taxonomic names. Not only is there a 'dark taxa' gap between putative species identified genetically and Linnaean species described by specialists, there's a 'dimly lit taxa' gap between the diversity taxonomists have already discovered, and the diversity they've named.

Dimly lit taxa range from genera and species given code names by a specialist or a group of collaborators, and listed by those codes in publications and databases, to potential type specimens once seen and long remembered by a specialist who plans to work them up in future, time and workload permitting.

In that phrase 'time and workload permitting' is a large part of the explanation for dimly lit taxa. Over the past month I created 71 species of this kind myself. Each has been code-named, diagnostically imaged, databased and placed in code-labelled bottles on museum shelves. The relevant museums have been given digital copies of the images and data.

The 71 are 'species-in-waiting'. They aren't formally named and described, but specialists like myself can refer to the images and data for identifying new specimens, building morphological and biogeographical hypotheses, and widening awareness of diversity in the group to which the 71 belong.

'Time and workload permitting'. Many of the 71 are poor-quality or fragmented museum specimens from which important morphological data, let alone sequences, cannot be obtained. Fresh specimens are needed, and fieldwork is neither quick nor easy. In my special corner of zoology, as in most such corners in zoology and botany, the widespread and abundant species are all, or nearly all, named. The unnamed rump consists of a huge diversity of geographically restricted and uncommon species. There are more than 71 in that group of mine; those are just the rare species I know about, so far.

'Time and workload permitting'. A non-taxonomist might ask, 'Why don't you just name and describe the 71 briefly, so that the names are at least available, and the gap between what's known and what's named is narrowed?' The answer is simple: inadequate descriptions are the bane of taxonomy. There are hundreds of species in my special group that were named and inadequately described long ago, and which wind up on checklists of names as 'nomen dubium' and 'incertae sedis'. Clearing up the mysteries means locating the types (which hopefully still exist) and studying them. That slow and tedious study would better have been done by the first describer.

Cybertaxonomic tools can help bring dimly lit taxa into full light, but not much. The rate-limiting steps in lighting up taxa are in the minds and lives of human taxonomists coping with the huge and bewilderingly complex diversity of life. It's not the tools used after the observing and thinking is done, it's the observing and thinking.

In their article 'Ramping up biodiversity discovery via online quantum contributions' (http://dx.doi.org/10.1016/j.tree.2011.10.010), Maddison et al. argue that the pace of naming and description can be increased if information about what I've called dimly lit taxa is publicly posted, piece by piece, 'publish as you go', on the Internet. In my case, I would upload images and data for my 71 'species-in-waiting' to suitable sites and make them freely available.

Excited by these discoveries, amateurs and professionals would rush to search for fresh specimens. Specialists would drop whatever else they were doing, borrow the existing specimens of the 71 from their repositories and do careful inventories of the morphological features I haven't documented. Aroused from their humdrum phylogenetic analyses of other organisms, molecular phylogeny labs would apply for extra funding to work on my 71 dimly lit taxa. In no time at all, a proud team of amateurs and specialists would be publishing the results of their collaboration, with 71 names and descriptions.

Shortly afterwards, flocks of pigs would slowly circle the 71 type localities, flapping their wings in unison.

Memo to Maddison et al. and other would-be reformers: the rate of taxonomic discovery and documentation is very largely constrained by the supply of taxonomists. You want more names, find more namers.

Wednesday, July 11, 2012

Citations, Social Media & Science

Quick note that Morgan Jackson (@BioInFocus) has written nice blog post Citations, Social Media & Science inspired by the fact that the following paper:

Kwong, S., Srivathsan, A., & Meier, R. (2012). An update on DNA barcoding: low species coverage and numerous unidentified sequences. Cladistics, no–no. doi:10.1111/j.1096-0031.2012.00408.x

cites my "Dark taxa" in the body of the text but not in the list of literature cited. This prompted some discussion of DOIs and blog posts on Twitter:

@rdmpage @mfenner @martin_eve @miketaylor Yes. Basically we are looking at variation of what I discussed back in 2007: goo.gl/WzLm8
— Geoffrey Bilder (@gbilder) July 11, 2012

Read Morgan's post for more on this topic. While I personally would prefer to see my blog posts properly cited in papers like doi:10.1111/j.1096-0031.2012.00408.x, I suspect the authors did what they could given current conventions (blogs lack DOIs, are treated differently from papers, and many publishers cite URLs in the text, not the list of references cited). If we can provide DOIs (ideally from CrossRef so we become part of the regular citation network), suitable archiving, and — most importantly — content that people consider worthy of citation then perhaps this practice will change.

Friday, July 06, 2012

Post GBIC2012 thoughts

I'm back from Copenhagen and GBIC2012. The meeting spanned three fairly intense days (with the days immediately before and after also working days for some of us), and was run by a group of facilitators lead by Natasha Walker, who were described us as "an interesting (and delightfully brainy, if sometimes scatty) group of academics, researchers, museum managers and people close to policy...". I've attempted to capture tweets about the meeting using Storify.

There will be a document (perhaps several) based on the meeting, but until then here are a few quick thoughts. Note that the comments below are my own and you shouldn't read into this anything about what directions the GBIC document(s) will actually take.

Microbiology rocks

Highlight of the first day was Robert J. Robbin's talk which urged the audience to consider that life was mostly microbial, that the the things most people in the room cared about were actually merely a few twigs on the tree of life, that the tree of life didn't actually exist anyway, and many of the concepts that made sense for multicellular organisms simply didn't apply in the microbial world. Basically it was a homage to Carl Woese (see also Pace et al. 2012 doi:10.1073/pnas.1109716109) and a wake up call to biodiversity informaticians to stop viewing the world through multicellular eyes. (You can find all the keynotes from the first day here).

F1 large

From Pace, N. R. (1997). A Molecular View of Microbial Diversity and the Biosphere. Science, 276(5313), 734–740. doi:10.1126/science.276.5313.734

Sequences rule

The future of a lot of biodiversity science belongs to sequences, from simple DNA barcoding as a tool for species discovery and identification, metabarcoding as a tool for community analysis, to comparisons of metabolic pathways and beyond. The challenge for classical biodiversity informatics is how to engage with this, and to what extent we should try and map between, say sequences and classical taxa, or whether it might make more sense (gasp) to abandon the taxonomic legacy and move on. Perhaps are more nuanced response is that the point of connection between sequences and classical biodiversity data is unlikely to be at the level of taxonomic names (which are mostly tags for collections of things that look similar) but at the level of specimens and observations.

Ontologies considered harmful

This is my own particular hobby horse. Often the call would come "we need an ontology", to which I respond read Ontology is Overrated: Categories, Links, and Tags. I have several problems with ontologies. The first is that they are too easy to make and distract from the real problem. From my perspective a big challenge is linking data together, that is going from

Let's leave aside what "A" and "B" are (I suspect it matters less than people think), once we have the link then we can can start to do stuff. From my perspective, what ontologies give us is basically this:

So now we know the "type" of the link (e.g., "is a part of", "cites approvingly", etc.). I'm not arguing that this isn't useful to have, but if you don't have the network of links then typing the links becomes an idle exercise.

To give an example, the web itself can be modelled as simply nodes connected by links, ignoring the nature of the links between the web pages. The importance of those links can be inferred later from properties of the network. To a first approximation this is how Google works, it doesn't ask what the links "mean" it simply investigates the connections to determine how important each web page is. In the same way, we build citation networks without bothering to ask the nature of the citation (yes I know there are ontologies for citations, but anyone willing to bet how widely they'll be adopted?).

My second complaint is that building ontologies is easy, "easy" in the sense that get a bunch of people together, they squabble for a long time about terminology, and out comes an ontology. Maybe, if you're lucky, someone will adopt it. The cost of making ontologies, and indeed of adopting them is relatively low (although it might not seem like it at the time). The cost of linking data is, I'd argue, higher, because it requires that you trust someone else's identifiers to the extent that you use them for things you care about deeply. Consider the citation network that is emerging from the widespread adoption of DOIs by the publishing industry. Once people trust that the endpoints of the links will survive, then the network starts to grow. But without that trust, that leap of faith, there's no network (unless you have enough resources to build the whole thing internally yourself, which is what happened with the closed citation network owned by Thomson Reuters). It's much easier to silo the data using unique identifiers than it is to link to other data (it's a variant of the "not invented here" syndrome).

Lastly, ontologies can have short lives. They reflect a certain world view that can become out of date, or supplanted if the relationships between things that the ontology cares about can be computed using other data. For example, biological taxonomy is a huge ontology that is rapidly being supplanted by phylogenetic trees computed from sequence (and other) data (compare the classification used by flagship biodiversity projects like GBIF and EOL with the Pace tree of life shown above). Who needs an ontology when you can infer the actual relationships? Likewise, once you have GPS the value of a geographic ontology (say of place names) starts to decline. I can compute if I'm on a mountain simply by knowing where I am.

I'm not saying ontologies are always bad (they're not), nor that they can't be cool (they can be), I'm just suggesting that they aren't the first thing you need. And they certainly aren't a prerequisite for linking stuff together.

Google flu trends

Perhaps the most interesting idea that emerged was the notion of intelligently detecting changes in biodiversity (which is the kind of thing a lot of people want to know) in the way analogous to Google.org's Flu Trends uses flu-related search terms to predict flu outbreaks:

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2008). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–1014. doi:10.1038/nature07634

Could we do something like this for biodiversity data? For various reasons this suggestion become known at GBIC2012 as the "Heidorn paradigm".

Thinking globally

One challenge for a meeting like GBIC 2012 is scope. There's so much cool stuff to think about. From my perspective, a useful filter is to ask "what will happen anyway?" In other words, there is a lot of stuff (for example the growth of metabarcoding) that will happen regardless of anything the biodiversity informatics community does. People will make taxon-specific ontologies for organismal traits, digitise collections, assess biodiversity, etc. without necessarily requiring an entity like GBIF. The key question is "what won't happen at a global scale unless GBIF (or some other entity) gets involved?"

A Vast Machine

Lastly, in one session Tom Moritz mentioned a book that he felt we could learn from (A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming). The book recounts the history of climatology and its slow transition to a truly global science. I've started to read it, and it's fascinating to see the interplay between early visions of the future, and the technology (typically driven by military or large-scale commercial interests) that made possible the realisation of those visions. This is one reason why predicting the future is such a futile activity, the things that have the biggest effect come from unexpected sources, and effect things in ways it's hard to anticipate. On a final note, it took about a minute from the time from the time Tom mentioned the book to the time I had a copy from Amazon in the Kindle app on my iPad. Oh that accessing biodiversity data were that simple.

Sunday, July 01, 2012

Using orthographic projections to map organism distributions

For a current project I'm currently working I show organism distributions using data from GBIF, and I display that data on a map that uses the equirectangular projection. I've recently started to create a series of base maps using the GBIF colour scheme, which is simple but effective:

#666698 for the sea
#003333 for the land
#006600 for borders
yellow for localities

The distribution map is created by overlaying points on a bitmap background using SVG (see SVG specimen maps from SPARQL results for details). SVG is ideally suited to this because you can take the points, plot them in the x,y plane (where x is longitude and y is latitude) then use SVG transformations to move them to the proper place on the map.

For the base maps themselves I've also started to use SVG, partly because it's possible to edit them with a text editor (for example if you want to change the colours). I then use Inkscape to export the SVG to a PNG to use on the web site.

Gbif360x180

One thing that has bothered me about the equirectangular projection is that, although it is familiar and easy to work with, it gives a distorted view of the world:

This is particularly evident for organisms that have a circumpolar distribution. For example, Kerguelen's petrel Aphrodroma has a distribution that looks like this using the equirectangular projection:

This long, thin distribution looks rather different if we display it on a polar projection:

Likewise, classic Gondwanic distributions such as that of Gripopterygidae become clearer on a polar projection.

Computing the polar coordinates for a set of localities is straightforward (see for example this page) and using SVG to lay out the points also helps, because it's trivial to rotate them so that they match the orientation of the map. Ultimately it would be nice to have an embedded, rotatable 3D globe (like the Google Earth plugin, or a Javascript+SVG approach like this). But for now I think it's nice to have the option of using different projections available to help display distributions more faithfully.

The bitmap maps and their SVG sources are available on github.