Monday, August 28, 2017

Let’s rise up to unite taxonomy and technology

Holly Bik (@hollybik) has an opinion piece in PLoS Biology entitled "Let’s rise up to unite taxonomy and technology" https://doi.org/10.1371/journal.pbio.2002231 (thanks to @sjurdur for bringing this to my attention).

Journal pbio 2002231 g001

It's a passionate plea for integrating taxonomic knowledge and "omics" data. In her article Bik includes a mockup of the kind of tool she'd like to see (based in part on Phinch), and writes:

Step 2: Clicking on a specific data point (e.g., an OTU) will pull up any online information associated with that species ID or taxonomic group, such as Wikipedia entries, photos, DNA sequences, peer-reviewed articles, and geolocated species observations displayed on a map.

This sort of plea has been made any times, and reminds me very much of PLoS's own efforts when they wanted to build a "Biodiversity Hub" and biodiversity informatics basically failed them. The hub itself later closed down.. There's clearly a need for a simply way to summarise what we know about a species, but we've yet to really tackle this (on the face of it) fairly simple task.

Quickly summarising the available information about a species was the motivation behind my little tool iSpecies, which I recently reworked to use DBpedia, GBIF, CrossRef, EOL, TreeBASE and OpenTreeofLife as sources. For the nematode featured in Bik's figure (Desmoscolex) there's not a great deal of easily available information (see http://ispecies.org/?q=Desmoscolex). We can get a little more form other sources not queried by iSpecies, such as BioNames, which aggregates the primary taxonomic literature, see http://bionames.org/search/Desmoscolex.

Part of the problem is that taxonomy is fundamentally a "long tail" field, both in terms of the subject matter (a few very well know species, then millions of poorly known species) and our knowledge of those species (a large, scattered taxonomic literature, much of it not yet digitised, although progress is being made). Furthermore, the names of species (and our conception of them) can change, adding an additional challenge.

But I think we can do a lot better. Simple web-based tools like iSpecies can assemble reasonable information from multiple sources (and in multiple languages) on the fly. It would be nice to expand those sources (the more primary sources the better). The current iSpecies tool searches on species name. This works well if the sources being queried mention that name (e.g., in the title of a paper that has a DOI and is indexed by CrossRef). Given that many of the "omics" datasets Bik works with are likely to have dark taxa, what we'll also need is the ability to search, say, using NCBI taxon ids, and retrieve literature linked to sequences for those taxa

It would also be useful to package those up in a simple API that other tools could consume. For example, if I wanted to improve the utility of iSpecies, one approach would be to package up the results in a JSON object. Perhaps even use JSON-LD (with global identifiers for taxa, documents, etc.) to make it possible for consumers to easily integrate that data with their own.

Taxonomy could be on the brink of another golden age—if we play our cards right. As it is reinvented and reborn in the 21st century, taxonomy needs to retain its traditional organismal-focused approaches while simultaneously building bridges with phylogenetics, ecology, genomics, and the computational sciences.

Taxonomy is, of course, doing just this, albeit not nearly fast enough. There are some pretty serious obstacles, some of them cultural, but some of them due to the nature of the problem. Taxonomic knowledge is massively decentralised, mostly non-digital, and many of the key sources and aggregations are behind paywalls. There is also a fairly large "technical debt" to deal with. Ian Mulvany was recently interviewed by PLoS and he emphasised that because academic publishers had been online from early on they were pioneers, but at the same time this left them with a legacy of older technologies and approaches that can sometimes get in the way of new idea. I think taxonomy suffers from some of the same problems. Because taxonomy has long been involved with computers, sometime we needed up betting on the "wrong" solutions. For example, at one time XML was the new hotness, and people invested a lot of effort in developing XML schema, and then ontologies and RDF vocabularies. Meantime much of the web has moved to simple data formats such as JSON, many specialist vocabularies are gathering dust as schema.org takes off, and projects like Wikidata force us to rethink the need to topic-specific databases.

But these are technical details. For me the key point of "Let’s rise up to unite taxonomy and technology" is that it's a symptom of the continued failure of biodiversity informatics to actually address the needs of its users. People keep asking for fairly simple things, and we keep ignoring them (or explaining why it's MUCH harder than people think, which is another way of ignoring them).

Sunday, August 20, 2017

Notes on displaying big trees using Google Maps/Leaflet

Notes to self on web map-style tree viewers. The basic idea is to use Google Maps or Leaflet to display a tree. Hence we need to compute tiles. One approach is to use a database that supports spatial queries to store the x,y coordinates of the tree. When we draw a tile we compute the coordinates of that tile, based on position and zoom level, do a spatial query to extract all lines that intersect with the rectangle for that tile, and draw those.

A nice example of this is Lifemap (see also De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. doi:10.1371/journal.pbio.2001624).

It occurs to me that for trees that aren't too big we could do this without an external database. For example, what if we used a Javascript implementation of an R-tree, such as imbcmdth/RTree or its fork leaflet-extras/RTree. So, we could compute the coordinates of the nodes in the tree in "geographic" space, store the bounding box for each line/arc in an R-tree, then query that R-tree for lines that intersect with the bounding box of the relevant tile. We could use a clipping algorithm to only draw the bits of the lines that cross the tile itself.

Web maps, at least in my experience, make trips to a tile server to fetch a tile, we would want instead to call a routine within our web page, because all the data would be loaded into that page. So we'd need to modify the tile creating code.

The ultimate goal would be to have a single page web app that accepts a Newick-style tree and converts into a browsable, zoomable visualisation.