Wednesday, September 23, 2015

Visualising big phylogenies (yet again)

Inspired in part by the release of the draft tree of life (doi:10.1073/pnas.1423041112 by the Open Tree of Life, I've been revisiting (yet again) ways to visualise very big phylogenies (see Very large phylogeny viewer for my last attempt).

My latest experiment uses Google Maps to render a large tree. Googletree Google Maps uses "tiles" to create a zoomable interface, so we need to create tiles for different zoom levels for the phylogeny. To create this visualisation I did the following:

  1. The phylogeny laid out in a 256 x 256 grid.
  2. The position of each line in the drawing is stored in a MySQL database as a spatial element (in this case a LINESTRING)
  3. When the Google Maps interface needs to display a tile at a certain zoom level and location, a spatial SQL query retrieves the lines that intersect the bounds of the tile, then draws those using SVG.
Hence, the tiles are drawn on the fly, rather than stored as images on disk. This is a crude proof of concept so far, there are a few issues to tackle to make this usable:
  1. At the moment there are no labels. I would need a way to compute what tables to show at what zoom level ("level of detail"). In other words, at low levels of zoom we want just a few higher taxa to be picked out, whereas as we zoom in we want more and more taxa to be labelled, until at the highest zoom levels we have the tips individually labelled.
  2. Ideally each tile would require roughly the same amount of effort to draw. However, at the moment the code is very crude and simply draws every line that intersects a tile. For example, for zoom level 0 the entire tree fits on a single tile, so I draw the entire tree. This is not going to scale to very large trees, so I need to be able to "cull" a lot of the lines and draw only those that will be visible.
In the past I've steered away from Google Maps-style interfaces because the image is zoomed along both the x and y axes, which is not necessarily ideal. But the tradeoff is that I don't have to do any work handling user interactions, I just need to focus on efficiently rendering the image tiles.

All very crude, but I think this approach has potential, especially if the "level of detail" issue can be tackled.

Friday, September 18, 2015

Towards an interactive web-based phylogeny editor (à la MacClade)

Currently in classes where I teach the basics of tree building, we still fire up ancient iMacs, load up MacClade, and let the students have a play. Typically we give them the same data set and have a class competition to see which group can get the shortest tree by manually rearranging the branches. It’s fun, but the computers are old, and what’s nostalgic for me seems alien to the iPhone generation.

One thing I’ve always wanted to have is a simple MacClade-like tree editor for the Web, where the aim is not so much character analysis as teaching the basics of tree building. Something with the easy of use as Phylo (basically Candy Crush for sequence alignments).

Slide

The challenge is to keep things as simple as possible. One idea is to have a column of taxa and you can drag individual taxa up and down to rearrange the tree.

Interactive tree design notes

Imagine each row has the characters and their states. Unlike the Phylo game, where the goal is to slide the amino acids until you get a good aliognment, here we want to move the taxa to improve the tree (e.g., based on its parsimony score).

The problem is that we need to be able to generate all possible rearrangements for a given number of taxa. In the example above, if we move taxon C, there are five possible positions it could go on the remaining subtree:

2 But if we simply shuffle the order of the taxa we can’t generate all the trees. However, if we remember that we also have the internal nodes, then there is a simple way we can generate the trees. 1 When we draw a tree each row corresponds to a node. The gap between each pair of leaves (the taxa A,B,D) corresponds to the an internal nodes. So we could divide the drawing up into “hit zone”, so that if you drag the taxon we’re adding (“C”) onto the zone centred on a leaf, we add the taxon below that leaf; if we drag it onto a zone between two leaves, we attach it below that the corresponding internal node. From the user’s point of view they are still simply sliding taxa up and down, but in doing so we can create each of the possible trees.

We could implement this in a Web browser with some Javascript to handle the user moving the taxa, and draw the corresponding phylogeny to the left, and quickly update the (say, parsimony) score of the tree so that the user gets immediate feedback as to whether the rearrangement they’ve made improves the tree.

I think this could be a fun teaching tool, and if it supported touch then students could use their phones and tablets to get a sense of how tree building works.

Thursday, September 17, 2015

On having multiple DOI registration agencies for the same journal

On Friday I discovered that BHL has started issuing CrossRef DOIs for articles, starting with the journal Revue Suisse de Zoologie. The metadata for these articles comes from BioStor. After a WTF and WWIC moment, I tweeted about this, and something of a Twitter storm (and email storm) ensued:

To be clear, I'm very happy that BHL is finally assigning article-level DOIs, and that it is doing this via CrossRef. Readers of this blog may recall an earlier discussion about the relative merits of different types of DOIs, especially in the context of identifiers for articles. The bulk of the academic literature has DOIs issued by CrossRef, and these come with lots of nice services that make them a joy to use if you are a data aggregator, like me. There are other DOI registration agencies minting DOIs for articles, such as Airiti Library in Taiwan (e.g., doi:10.6165/tai.1998.43(2).150) and ISTIC (中文DOI) in China (e.g., doi:10.3969/j.issn.1000-7083.2014.05.020) (pro tip, if you want to find out the registration agency for a DOI, simply append it to http://doi.crossref.org/doiRA/, e.g. http://doi.crossref.org/doiRA/10.6165/tai.1998.43(2).150). These provide stable identifiers, but not the services needed to match existing bibliographic data to the corresponding DOI (as I discovered to my cost while working with IPNI).

However, now things get a little messy. From 2015 PDFs for Revue Suisse de Zoologie are being uploaded to Zenodo, and are getting DataCite DOIs there (e.g., doi:10.5281/zenodo.30012). This means that the most recent articles for this journal will not have CrossRef DOIs. From my perspective, this is a disappointing move. It removes the journal from the CrossRef ecosystem at a time when the uptake of CrossRef DOIs for taxonomic journals is at an all time high (both ZooKeys and Zootaxa have CrossRef DOIs), and now BHL is starting to issue CrossRef DOIs for the "legacy" literature (bear in mind that "legacy" in this context can mean articles published last year).

I've rehearsed the reasons why I think CrossRef DOIs are best elsewhere, but the keys points are that articles are much easier to discover (e.g., using http://search.crossref.org), and are automatically first class citizens of the academic literature. However, not everybody buys these arguments.

Maybe a way forward is to treat the two types of DOI as identifying two different things. The CrossRef DOI identifies the article, not a particular representation. The Zenodo DOI (or any DataCite DOI) for a PDF identifies that representation (i.e., the PDF), not the article.

Having CrossRef and Zenodo  DataCite DOIs coexist

This would enable CrossRef and Zenod DOIs to coexist, providing we can (a) have some way of describing the relationship between the two kinds of DOI (e.g., CrossRef DOI - hasRepresentation -> Zenodo DOI).

This would give freedom to those who want the biodiversity literature to be part of the wider CrossRef community to mint CrossRef DOIs to do so. It gives those articles the benefits that come with CrossRef DOIs (findability, being included in lists of literature cited, citation statistics, customer support when DOIs break, altmetrics, etc.)

It would also enable those who want to ensure stable access to the contents of the biodiversity literature to use archives such as Zenodo, and have the benefits of those DOIs (stability, altmetrics, free file storage and free DOIs).

Having multiple DOIs for the same thing is, I'd argue, at the very least, unhelpful. But if we tease apart the notion of what we are identifying, maybe they can coexist. Otherwise I think we are in danger of making choices that, while they seem locally optimal (e.g., free storage and minting of DOIs), may in the long run cause problems and run counter to the goal of making the taxonomic literature has findable as the wider literature.

Friday, September 11, 2015

Possible project: natural language queries, or answering "how many species are there?"

Google Google knows how many species there are. More significantly, it knows what I mean when I type in "how many species are there". Wouldn't it be nice to be able to do this with biodiversity databases? For example, how many species of insect are found in Fiji? How would you answer this question? I guess you'd Google it, looking for a paper. Or you'd look in vain on GBIF, and then end up hacking some API queries to process data and come up with an estimate. Why can't we just ask?

On the face of it, natural language queries are hard, but there's been a lot of work down in this area. Furthermore, there's a nice connection with the idea of knowledge graphs. One approach to natural language parsing is to convert a natural language query to a path in a knowledge graph (or, if you're Facebook, the social graph). Facebook has some nice posts describing how their graph search works (e.g., Under the Hood: Building out the infrastructure for Graph Search), and there's a paper describing some of the infrastructure (e.g., "Unicorn: a system for searching the social graph" doi:10.14778/2536222.2536239, get the PDF here).

Natural language queries can seem potentially unbounded, in the sense that the user could type in anything. But there are ways to constrain this, and ways to anticipate what the user is after. For example, Google suggests what you may be after, which gives us clues as to the sort of questions we'd need answers for. It would be a fun exercise to use Google suggest to discover what questions people are asking about biodiversity, then determine what would it take to be able to answer them.

Suggest All very sensible questions that existing biodiversity databases would struggle to answer.

There's a nice presentation by Kenny Bastani where he tackles the problem of bounding the set of possible questions by first generating the questions for which he answers, then caching those so that the user can select from them (using, for example, a type-ahead interface).

Hence, we could generate species counts for all major and/or charismatic taxa for each country, habitat type (or other meaningful category), then generate the corresponding query (e.g., "how many species of birds are there in Fiji", where the yellow and cyan" terms are the things we replace for each query).

One reason this topic appeals is that it is intimiately linked to the idea of a biodiversity knowledge graph, in that answers to a number of questions in biodiversity can be framed as paths in that graph. Do, if we build the graph we should also be asking about ways to query it. In particular, how do we answer the most basic questions of the information we are aggregating in myriad databases.

Monday, September 07, 2015

Wikidata, Wikipedia, and #wikisci

Last week I attended the Wikipedia Science Conference (hashtag: #wikisci) at the Wellcome Trust in London. it was an interesting two days of talks and discussion. Below are a few random notes on topics that caught my eye.

What is Wikidata?

A recurring theme was the emergence of Wikidata, although it never really seemed clear what role Wikidata saw for itself. On the one hand, it seems to have a clear purpose:
Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others.
At other times there was a sense that Wikidata wanted to take any and all data, which it doesn't really seemed geared up to do. The English language Wikipedia has nearly 5 million articles, but there are lots of scientific databases that dwarf that in size (we have at least that many taxonomic names, for example). So, when Dario Taraborelli suggests building a repository of all citations with Wikidata, does he really mean ALL citations in the academic literature? CrossRef alone has 75M DOIs, whereas Wikidata currently has 14.8M pages, so we are talking about greatly expanding the size of Wikidata with just one type of data.

The sense I get is that Wikidata will have an important role in (a) structuring data in Wikipedia, and (b) providing tools for people to map their data to the equivalent topics in Wikipedia. Both are very useful goals. What I find less obvious is whether (and if so, how) Wikidata aims to be a more global database of facts.

How do you Snapchat? You just Snapchat

As a relative outsider to the Wikipedia community, and having had a sometimes troubled experience with Wikipedia, it struck me that how opaque things are if your are an outsider. I suspect this is true of most communities, if you are a member then things seem obvious, if you're not, it takes time to find out how things are done. Wikipedia is a community with nobody in charge, which is a strength, but can also be frustrating. The answer to pretty much any question about how to add data to Wikidata, how to add data types, etc. was "ask the community". I'm reminded of the American complaint about the European Union "if you pick up the phone to call Europe, who do you call?". In order to engage you have to invest time in discovering the relevant part of the community, and then learn engage with it. This can be time consuming, and is a different approach to either having to satisfy the requirements of gatekeepers, or a decentralised approach where you can simply upload whatever you want.

Streams

It seems that everything is becoming a stream. Once the volume of activity reaches a certain point, people don't talk about downloading static datasets, but instead of consuming a stream of data (very like the Twitter firehose). The volume of Wikipedia edits means scissile scientists studying the growth of Wikipedia are now consuming streams. Geoffrey Bilder of CrossRef showed some interesting visualisations of real-time streams of DOIs being as users edited Wikipedia pages CrossRef DOI Events for Wikimedia, and Peter Murray-Rust of ContentMine seemed to imply that ContentMine is going to generate streams of facts (rather than, say, a query able database of facts). Once we get to the stage of having large, transient volumes of data, all sorts of issues about reanalysis and reproducibility arise.

CrossRef and evidence

One of the other strking visualisations that CrossRef have is the DOI Chronograph, which displays the numbers of CrossRef DOI resolutions by the domain of the hosting web site. In other words, if you are on a Wikipedia page and click on a DOI for an article, that's recorded as a DOI resolution from the domain "wikipedia.org". For the period 1 October 2010 to 1 May 2015 Wikipedia was the source of 6.8 million clicks on DOIs, see http://chronograph.labs.crossref.org/domains/wikipedia.org. One way to interpret this is that it's a measure of how many people are accessing the primary literature - the underlying evidence - for assertions made on Wikipedia pages. We can compare this with results for, say, biodiversity informatics projects. For example, EOL has 585(!) DOI clicks for the period 15 October 2010 to 30 April 2015. There are all sorts of reasons for the difference between these two sites, such as Wikipedia has vastly more traffic than EOL. But I think it also reflects the fact that many Wikipedia articles are richly referenced with citations to the primary literature, and projects like EOL are very poorly linked to that literature. Indeed, most biodiversity databases are divorced from the evidence behind the data they display.

Diversity and a revolution led by greybeards

"Diversity" is one of those words that has become politicised, and attempts to promote "diversity" can get awkward ("let's hear from the women", that homogeneous category of non-men). But the aspect of diversity that struck me was age-related. In discussions that involved fellow academics, invariably they looked a lot like me - old(ish), relatively well-established and secure in their jobs (or post-job). This is a revolution led not by the young, but by the greybeards. That's a worry. Perhaps it's a reflection of the pressures on young or early-stage scientists to get papers into high-impact factor journals, get grants, and generally play the existing game, whereas exploring new modes of publishing, output, impact, and engagement have real risks and few tangible rewards if you haven't yet established yourself in academia.

Wednesday, September 02, 2015

Hypothes.is revisited: annotating articles in BioStor

YClX4 gV Over the weekend, out of the blue, Dan Whaley commented on an earlier blog post of mine (Altmetrics, Disqus, GBIF, JSTOR, and annotating biodiversity data. Dan is the project lead for hypothes.is, a tool to annotate web pages. I was a bit dismissive as hypothes.is falls into the "stick note" camp of annotation tools, which I've previously been critical of.

However, I decided to take another look at hypothes.is and it looks like a great fit to another annotation problem I have, namely augmenting and correcting OCR text in BioStor (and, by extension, BHL). For a subset of BioStor I've been able to add text to the page images, so you can select that text as you would on a web page or in a PDF with searchable text. If you can select text, you can annotate it using hypothes.is. Then I discovered that not only is hypothes.is a Chrome extension (which immediately limits who will use it), you can add it to any web site that you publish. So, as an experiment I've added it to BioStor, so that people can comment on BioStor using any modern browser.

So far, so good, but the problem is I'm relying on the "crowd" to come along and manually annotate the text. But I have code that can take text and extract geographic localities (e.g., latitude and longitude pairs), specimen does, and taxonomic names. What I'd really like to do is be able pre-process the text, locate these features, then programmatically add those as annotations. Who wants to do this manually when a computer can do most of it?

Hypothesi.is, it turns out, has an API that, while a bit *cough* skimpy on documentation, enables you to add annotations to text. So now I could pre-process the text, and just ask people to add things that have been missed, or flag errors on the automated annotations.

This is all still very preliminary, but as an example here's a screen shot of a page in BioStor together with geographic annotations displayed using hypothes.is (you can see this live here: http://biostor.org/reference/147608/page/1 (make sure you click on the widgets at the top right of the page to see the annotations):

Hypothesis

The page shows two point localities that have been extracted from the text, together with a static Google Map showing the localities (hypothesis.is supports Markdown in annotations, which enables links and images to be embedded).

Not only can we write annotations, we can also read them, so if someone adds an annotation (e.g., highlights a specimen code that was missed, or some text that OCR has messed up) we could retrieve that and, for example, index the corrected text to improve findability.

Lots still to do, but these early experiments are very encouraging.

Tuesday, September 01, 2015

Dark taxa, drones, and Dan Janzen: 6th International Barcode of Life Conference

6thBOL logo 300x237 A little over a week ago I was at the 6th International Barcode of Life Conference, held at Guelph, Canada. It was my first barcoding conference, and was quite an experience. Here are a few random thoughts.

Attendees

It was striking how diverse the conference crowd was. Apart from a few ageing systematists (including veterans of the cladistics wars), most people were young(ish), and from all over the world. There clearly something about the simplicity and low barrier to entry of barcoding that has enabled its widespread adoption. This also helps give barcoding a cohesion, no matter what the taxonomic group or the problem you are tackling, you are doing much the same thing as everybody else (but see below). While ageing systematists (like myself) may hold their noses regarding the use of a single, short DNA sequence and a tree-building method some would dismiss as "phenetic", in many ways the conference was a celebration of global-scale phylogeography.

Standards aren't enough

And yet, standards aren't enough. I think what contributes to DNA barcoding's success is that sequences are computable. If you have a barcode, there's already a bunch of barcodes sequences you can compare yours to. As others add barcodes, your sequences will be included in subsequent analyses, analyses which may help resolve the identity of what you sequenced.

To put this another way, we have standard image file formats, such as JPEG. This means you can send me a bunch of files, safe in the knowledge that because JPEG is a standard I will be able to open those files. But this doesn't mean that I can do anything useful with them. In fact, it's pretty hard to do anything with images part from look at them. But if you send me a bunch of DNA sequences for the same region, I can build a tree, BLAST GenBank for similar sequences, etc. Standards aren't enough by themselves, to get the explosive growth that we see in barcodes the thing you standardise on needs to be easy to work with, and have a computational infrastructure in place.

Next generation sequencing and the hacker culture

Classical DNA barcoding for animals uses a single, short mtDNA marker that people were sequencing a couple of decades ago. Technology has moved on, such that we're seeing papers such as An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. As I've argued earlier (Is DNA barcoding dead?) this misses the point about the power of standardisation on a simple, scalable method.

At the same time, it was striking to see the diversity of sequencing methods being used in conference presentations. Barcoding is a broad church, and it seemed like it was a natural home for people interested in environmental DNA. There was excitement about technologies such as the Oxford Nanopore MinION™, with people eager to share tips and techniques. There's something of a hacker culture around sequencing (see also Biohackers gear up for genome editing), just as there is for computer hardware and software.

Community

The final session of the conference started with some community bonding, complete with Paul Hebert versus Quentin Wheeler wielding light sables. If, like me, you weren't a barcode, things started getting a little cult-like. But there's no doubt that Paul's achievement in promoting a simple approach to identifying organisms, and then translating that into a multi-million dollar, international endeavour is quite extraordinary.

After the community bonding, came a wonderful talk by Dan Janzen. The room was transfixed as Dan made the case for conservation, based on his own life experiences, including Area de Conservación Guanacaste where he and Winnie Hallwachs have been involved since the 1970s. I sat next to Dan at a dinner after the conference, and showed him iNaturalist, a great tool for documenting biodiversity with your phone. He was intrigued, and once we found pictures taken near his house in Costa Rica, he was able to identify the individual animals in the photos, such as a bird that has since been eaten by a snake.

Dark taxa

My own contribution to the conference was a riff on the notion of dark taxa, and mostly consisted of me trying think through how to respond to DNA barcoding. The three responses to barcoding that I came up with are:
  1. By comparison to barcoding, classical taxonomy is digitally almost invisible, with great chunks of the literature still not scanned or accessible. So, one response is to try and get the core data of taxonomy digitised and linked as fast as possible. This is why I built BioStor and BioNames, and why I continually rant about the state of efforts to digitise taxonomy.
  2. This is essentially President Obama's "bucket" approach, maybe the sane thing to do is see barcoding as the future and envisage what we could do in a sequence only world. This is not to argue that we should ignore the taxonomic literature as such, but rather lets start with sequences first and see what we can do. This is the motivation for my Displaying a million DNA barcodes on Google Maps using CouchDB, and my experiments with Visualising Geophylogenies in Web Maps Using GeoJSON. These barely scratch the surface of what can be done.
  3. The third approach is to explore how we integrate taxonomy and barcoding at global scale, in which case linking at specimen level (rather, than, say using taxonomic names) seems promising, albeit requiring a massive effort to reconcile multiple specimen identifiers.

Summary

Yes, the barcoding conference was that rare thing, a well organised (including well-fed), interesting, indeed eye-opening, conference.