Thursday, August 03, 2023

The problem with GBIF's Phylogeny Explorer

How to cite: Page, R. (2023). The problem with GBIF’s Phylogeny Explorer. https://doi.org/10.59350/v0bt3-zp114

GBIF recently released the Phylogeny Explorer, using legumes as an example dataset. The goal is to enables users to “view occurrence data from the GBIF network aligned to legume phylogeny.” The screenshot below shows the legume phylogeny side-by-side with GBIF data.

Now, I’m all in favour of integrating phylogenies and occurrence data, and I have a lot of respect for the people behind this project (Morten Høfft and Thomas Stjernegaard Jeppesen), but I think this way of displaying a phylogeny has multiple problems. Indeed, it suffers from many of the classic “mistakes” people make when trying to view big trees.

Why maps work

Tree visualisation is a challenging problem. I wrote a somwhwat out of date review on this topic a decade ago, and Googling will find many papers on the topic. There is also the amazing treevis.net.

I think the key issues can be seen once we compare the tree on the left with the map on the right. The map allows zooming in and out, and it does this equally in both the x and y dimensions. In other words, when you zoom in the map expands left to right and top to bottom. This makes sense because the map is a square. Obviously the Earth is not a square, but the projection used by web maps (such as Google Maps, OpenStreetMap, etc.) treats the world as one. Below is the world at zoom level 0, a 256 x 256 pixel square.

When you zoom in the number of tiles is doubled with each increase in zoom level, and you get a more and more detailed map. As you zoom in on a map typically you see labels appearing and disappearing. These labels are (a) always legible, and (b) they change with zoom level. Continent names appear before cities, but disappear once you’ve zoomed in to country level or below.

To summarise, the map visualisation zooms appropriately, always has legible labels, and the level of detail and labelling changes with zoom level. None of this is true for the GBIF phylogeny viewer.

The phylogeny problem

The screenshot below shows GBIF’s display of the legume tree such that the whole tree fits into the window. No labels are visible, and the tree structure is hard to see. There are no labels for major groups, so we have no obvious way to find our way around the tree.

We can zoom so that we can see the labels, but everything is zoomed, such that we can’t see all the tree structure to the left.

Indeed, if we zoom in more we rapidly lose sight of most of the tree.

This is one of the challenges presented by trees. Like space, they are mostly empty. hence simply zooming in is often not helpful.

So, the zooming doesn’t correspond to the structure of the tree, labels are often either not legible or absent, and levels of detail don’t change with zooming in and out.

What can we do differently?

I’m going to sketch an alternative approach to viewing trees like this. I have some ropey code that I’ve used to create the diagrams below. This isn’t ready for prime time, but hopefully illustrates the idea. The key concept is that we zoom NOT by simply expanding the viewing area in the x and y direction, but by collapsing and expanding the tree. Each zoom level corresponds the number of nodes we will show in the tree. We use a criterion to rank the importance of each node in the tree. One approach is how “distinctive” the nodes are, see Jin et al. 2009. We then use a priority queue to chose the nodes to display at a given zoom level (see Libin et al. 2017 and Zaslavsky et al. 2007).

Arguably this gives us a more natural way to zoom a tree, we see the main structure first, then as we zoom in more structure becomes apparent. It turns out if the tree drawing itself is constructed using a “in-order” traversal we can greatly simplify the drawing. Imagine that the tree consists of a number of nodes (both internal and external, i.e., leaves and hypothetical ancestors), and we draw each node on a single line (as if we were using a line printer). Collapsing or expanding the tree is simply a matter of removing or adding lines. If a node is not visible we don’t draw it. If a leaf node is visible we show it as if the whole tree was visible. Internal nodes are slightly different. If it is visible but collapsed we can draw it with a triangle representing the descendants, if it is not collapsed then we draw it as if the whole tree was visible. The end result is that we don’t need to recompute the tree as we zoom in or out, we simply compute which nodes to show, and in what state.

As an experiment I decided to explore the legume tree used in the GBIF website. As is sadly so typical, the original publication of the tree (Ringelberg et al. 2023) doesn’t provide the actual tree, but I found a JSON version on GitHub https://github.com/gbif/hp-legume/tree/master/assets/phylotree. I then converted that to Newick format so my tools could use it (had a few bumpy moments when I discovered that the tree has negative branch lengths!). The converted file is here: https://gist.github.com/rdmpage/ef43ea75a738e75ec303602f76bf0b2e

I then ran the tree through my code and generated views at various zoom levels.

Note that as the tree expands labels are always legible, and zooming only increased the size of the tree in the y-axis (as the expanded nodes take up more space). Note also that we see a number of isolated taxa appearing, such as Lachesiodendron viridiflorum. These taxa are often of evolutionary interest, and also of high conservation interest due to their phylogenetic isolation. Simply showing the whole tree hides these taxa.

Now, looking at these two diagrams there are two obvious limitations. The first is that the black triangles representing collapsed clades are all the same size regardless of whether they represent a few of many taxa. This could be addressed by adding numbers beside each triangle, using colour to reflect the numebr of collapsed nodes, or perhaps by breaking the “one node per row” rule by drawing particularly large nodes over two or more lines.

The other issue is that most of the triangles lack labels. This is because the tree itself lacks them (I added “Ingoid clade”, for example). There will be lots of nodes which can be labelled (e.g., by genus name), but once we start displaying phylogeny we will need to make use of informal names, or construct labels based on the descendants (e.g., “genus 1 - genus 5”). We can also think of having sets of labels that we locate on the tree by finding the least common ancestor (AKA the most recent common ancestor) of that label (hello Phylocode).

Another consideration is what to do with labels as taxa are expanded?. One approach would be to use shaded regions, for example in the last tree above we could shade the clades rooted at Mimosa, Vachellia, and the “Ingoid clade” (and others if they had labels). If we were clever we could alter which clades are shaded based on the zoom level. If we wanted these regions to not overlap (for example, if we wanted bands of colour corresponding to clades to appear on the right of the tree) then we could use something like maximum disjoint sets to choice the best combination of labels.

Summary

I don’t claim that this alternative visualisation is perfect (and my implementation of it is very far from perfect). but I think it shows that there are ways we can zoom into trees that reflects tree structure, ensures labels are always legible, and that supports levels of detail (collapsed nodes expanding as we zoom). The use of inorder traversal and three styles of node drawing mean that the diagram is simple to render. We don’t need fancy graphics, we can simply have a list of images.

To conclude, I think it’s great GBIF is moving to include phylogenies. But we can't visualise phylogeny as a static image, it's a structure that requires us to think about how to display it with the same level of creativity that makes web maps such a successful visualisation.

Reading

Jin Chen, MacEachren, A. M., & Peuquet, D. J. (2009). Constructing Overview + Detail Dendrogram-Matrix Views. IEEE Transactions on Visualization and Computer Graphics, 15(6), 889–896. https://doi.org/10.1109/tvcg.2009.130

Libin, P., Vanden Eynden, E., Incardona, F., Nowé, A., Bezenchek, A., … Sönnerborg, A. (2017). PhyloGeoTool: interactively exploring large phylogenies in an epidemiological context. Bioinformatics, 33(24), 3993–3995. doi:10.1093/bioinformatics/btx535

Page, R. D. M. (2012). Space, time, form: Viewing the Tree of Life. Trends in Ecology & Evolution, 27(2), 113–120. https://doi.org/10.1016/j.tree.2011.12.002)

Ribeiro, P. G., Luckow, M., Lewis, G. P., Simon, M. F., Cardoso, D., De Souza, É. R., Conceição Silva, A. P., Jesus, M. C., Dos Santos, F. A. R., Azevedo, V., & De Queiroz, L. P. (2018). lachesiodendron , a new monospecific genus segregated from piptadenia(Leguminosae: Caesalpinioideae: mimosoid clade): evidence from morphology and molecules. TAXON, 67(1), 37–54. https://doi.org/10.12705/671.3

Ringelberg, J. J., Koenen, E. J. M., Sauter, B., Aebli, A., Rando, J. G., Iganci, J. R., De Queiroz, L. P., Murphy, D. J., Gaudeul, M., Bruneau, A., Luckow, M., Lewis, G. P., Miller, J. T., Simon, M. F., Jordão, L. S. B., Morales, M., Bailey, C. D., Nageswara-Rao, M., Nicholls, J. A., … Hughes, C. E. (2023). Precipitation is the main axis of tropical plant phylogenetic turnover across space and time. Science Advances, 9(7), eade4954. https://doi.org/10.1126/sciadv.ade4954

Zaslavsky L., Bao Y., Tatusova T.A. (2007) An Adaptive Resolution Tree Visualization of Large Influenza Virus Sequence Datasets. In: Măndoiu I., Zelikovsky A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science, vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_18

Written with StackEdit.