Thursday, August 31, 2023

Document layout analysis

How to cite: Page, R. (2023). Document layout analysis.

Some notes to self on document layout analysis.

I’m revisiting the problem of taking a PDF or a scanned document and determining its structure (for example, where is the title, abstract, bibliography, where are the figures and their captions, etc.). There are lots of papers on this topic, and lots of tools. I want something that I can use to process both born-digital PDFs and scanned documents, such as the ABBYY, DjVu and hOCR files on the Internet Archive. PDFs remain the dominant vehicle for publishing taxonomic papers, and aren’t going away any time soon (see Pettifer et al. for a nuanced discussion of PDFs).

There are at least three approaches to document layout analysis.


The simplest approach is to come up rules, such as “if the text is large and it’s on the first page, it’s the title of the article”. Examples of more sophisticated rules are given in Klampfl et al., Ramakrishnan et al., and Lin. Rule-based methods can get you a long way, as shown by projects such as Plazi. But there are always exceptions to rules, and so the rules need constant tweaking. At some point it makes sense to consider probabilistic methods that allow for uncertainty, and which can also “learn”.

Large language models (LLMs)

At the other extreme are Large language models (LLMs), which have got a lot of publicity lately. There are a number of tools that use LLMs to help extract information from documents, such as LayoutLM (Xu et al.), Layout Parser, and VILA (Shen et al.). These approaches encode information about a document (in some case including the (x,y) coordinates of individual words on a page) and try and infer which category each word (or block of text) belongs to. These methods are typically coded in Python, and come with various tools to display regions on pages. I’ve had variable success getting these tools to work (I am new to Python, and am also working on a recent Mac which is not the most widely used hardware for machine learning). I have got other ML tools to work, such as an Inception-based model to classify images (see Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera), but I’ve not succeeded in training these models. There are obscure Python error messages, some involving Hugging Face, and eventually my patience wore out.

Another aspect of these methods is that they often package everything together, such that they take a PDF, use OCR or ML methods such as Detectron to locate blocks, then encode the results and feed them to a model. This is great, but I don’t necessarily want the whole package, I want just some parts of it. Nor does the prospect of lengthy training appeal (even if I could get it to work properly).

The approach that appealed the most is VILA, which doesn’t use (x,y) coordinates directly but instead encodes information about “blocks” into text extracted from a PDF, then uses an LLM to infer document structure. There is a simple demo at Hugging Face. After some experimentation with the code, I’ve ended up using the way VILA represents a document (a JSON file with a series of pages, each with lists of words, their positions, and information on lines, blocks, etc.) as the format for my experiments. If nothing else this means that if I go back to trying to train these models I will have data already prepared in an appropriate format. I’ve also decided to follow VILA’s scheme for labelling words and blocks in a document:

  • Title
  • Author
  • Abstract
  • Keywords
  • Section
  • Paragraph
  • List
  • Bibliography
  • Equation
  • Algorithm
  • Figure
  • Table
  • Caption
  • Header
  • Footer
  • Footnote

I’ve tweaked this slightly by adding two additional tags
from VILA’s Labeling Category Reference, the “semantic” tags “Affiliation” and “Venue”. This helps separate information on author names (“Author”) from their affiliations, which can appear in very different positions to the author’s names. “Venue” is useful to label things such as a banner at the top of an article where the publisher display the name of the journal, etc.

Conditional random fields

In between masses of regular expressions and large language models are approaches such as Conditional random fields (CRFs), which I’ve used before to parse citations (see Citation parsing tool released). Well known tools such as GROBID use this approach.

CRFs are fast, and somewhat comprehensible. But it does require Feature engineering, that is, you need to come up with features of the data to help train the model (for the systematists among you, this is very like coming up with characters for a bunch of taxa). This is were you can reuse the rules developed in a rules-based approach, but instead of having the rules make decisions (e.g., “big text = Title”), you just a rule that detects whether text is big or not, and the model combined with training data then figures out if and when big text means “Title”. So you end up spending time trying to figure out how to represent document structure, and what features help the model get the right answer. For example, methods such as Lin’s for detecting whether there are recurring elements in a document are great source of features to help recognise headers and footers. CRFs also make it straightforward to include dependencies (the “conditional” in the name). For example, a bibliography in a paper can be recognised not just by a line having a year in it (e.g., “2020”), but there being nearby lines that also have years in them. This helps us avoid labelling isolated lines with years as “Bibliography” when they are simply text in a paragraph that mentions a year.

Compared to LLMs this a lot of work. In principle with an LLM you “just” take a lot of training data (e.g., text and location on a page) and let the model to the hard work of figuring out which bit of the document corresponds to which category (e.g., title, abstract, paragraph, bibliography). The underlying model has already been trained on (potentially) vast amounts of text (and sometimes also word coordinates). But on the plus side, training CRFs is very quick, and hence you can experiment with adding or removing features, adding training data, etc. For example, I’ve started training with about ten (10) documents, training takes seconds, and I’ve got serviceable results.

Lots of room for improvement, but there’s a constant feedback loop of seeing improvements, and thinking about how to tweak the features. It also encourages me to think about what went wrong.

Problems with PDF parsing

To process PDFs, especially “born digital” PDFs I rely on pdf2xml, originally written by Hervé Déjean (Xerox Research Centre Europe). It works really well, but I’ve encountered a few issues. Some can be fixed by adding more fonts to my laptop (from XpdfReader), but others are more subtle.

The algorithm used to assign words to “blocks” (e.g., paragraphs) seems to struggle with superscripts (e.g., 1), which often end up being treated as separate blocks. This breaks up lines of text, and also makes it harder to accurately label parts of the document such as “Author” or “Affiliation”.

Figures can also be problematic. Many are simply bitmaps embedded in a PDF and can be easily extracted, but sometimes labelling on those bitmaps, or indeed big chunks of vector diagrams are treated as text, so we end up with story text blocks in odd positions. I need to spend a little time thinking about this as well. I also need to understand the “vet” format pdftoxml extracts from PDFs.

PDFs also have all sorts of quirks, such as publishers slapping cover pages on the front, which make feature engineering hard (the biggest text might now be not be the title but some cruff from the publisher). Sometimes there are clues in the PDF that it has been moodier.! For example, ResearchGate inserts a “rgid” tag in the PDF when it adds a cover page.

Yes but why?

So, why I am doing this? Why battle with the much maligned PDF format. It’s because a huge chunk of taxonomic and other information is locked up in PDFs, and I’d like a simpler, scalable, way to extract some of that. Plazi is obviously the leader in this are in terms of the amount of information they have extracted, but their approach is labour-intensive. I want something that is essentially automatic, that can be trained to handle the idiosyncracities of the taxonomic literature, and can be applied to both born digital PDFs and OCR from scans in the Biodiversity Heritage Library and elsewhere. Even if we could simply extract bibliographic information (to flesh out the citation graph) and the figures, that would be progress.


Déjean H, Meunier J-L (2006) A System for Converting PDF Documents into Structured XML Format. In: Bunke H, Spitz AL (eds) Document Analysis Systems VII. Springer, Berlin, Heidelberg, pp 129–140

Klampfl S, Granitzer M, Jack K, Kern R (2014) Unsupervised document structure analysis of digital scientific articles. Int J Digit Libr 14(3):83–99.

Lin X (2003) Header and footer extraction by page association. In: Document Recognition and Retrieval X. SPIE, pp 164–171

Pettifer S, McDERMOTT P, Marsh J, Thorne D, Villeger A, Attwood TK (2011) Ceci n’est pas un hamburger: modelling and representing the scholarly article. Learned Publishing 24(3):207–220.

Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code for Biology and Medicine 7(1):7.

Shen Z, Lo K, Wang LL, Kuehl B, Weld DS, Downey D (2022) VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups. Transactions of the Association for Computational Linguistics 10:376–392.

Xu Y, Li M, Cui L, Huang S, Wei F, Zhou M (2020) LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp 1192–1200

Written with StackEdit.

Thursday, August 03, 2023

The problem with GBIF's Phylogeny Explorer

How to cite: Page, R. (2023). The problem with GBIF’s Phylogeny Explorer.

GBIF recently released the Phylogeny Explorer, using legumes as an example dataset. The goal is to enables users to “view occurrence data from the GBIF network aligned to legume phylogeny.” The screenshot below shows the legume phylogeny side-by-side with GBIF data.

Now, I’m all in favour of integrating phylogenies and occurrence data, and I have a lot of respect for the people behind this project (Morten Høfft and Thomas Stjernegaard Jeppesen), but I think this way of displaying a phylogeny has multiple problems. Indeed, it suffers from many of the classic “mistakes” people make when trying to view big trees.

Why maps work

Tree visualisation is a challenging problem. I wrote a somwhwat out of date review on this topic a decade ago, and Googling will find many papers on the topic. There is also the amazing

I think the key issues can be seen once we compare the tree on the left with the map on the right. The map allows zooming in and out, and it does this equally in both the x and y dimensions. In other words, when you zoom in the map expands left to right and top to bottom. This makes sense because the map is a square. Obviously the Earth is not a square, but the projection used by web maps (such as Google Maps, OpenStreetMap, etc.) treats the world as one. Below is the world at zoom level 0, a 256 x 256 pixel square.

When you zoom in the number of tiles is doubled with each increase in zoom level, and you get a more and more detailed map. As you zoom in on a map typically you see labels appearing and disappearing. These labels are (a) always legible, and (b) they change with zoom level. Continent names appear before cities, but disappear once you’ve zoomed in to country level or below.

To summarise, the map visualisation zooms appropriately, always has legible labels, and the level of detail and labelling changes with zoom level. None of this is true for the GBIF phylogeny viewer.

The phylogeny problem

The screenshot below shows GBIF’s display of the legume tree such that the whole tree fits into the window. No labels are visible, and the tree structure is hard to see. There are no labels for major groups, so we have no obvious way to find our way around the tree.

We can zoom so that we can see the labels, but everything is zoomed, such that we can’t see all the tree structure to the left.

Indeed, if we zoom in more we rapidly lose sight of most of the tree.

This is one of the challenges presented by trees. Like space, they are mostly empty. hence simply zooming in is often not helpful.

So, the zooming doesn’t correspond to the structure of the tree, labels are often either not legible or absent, and levels of detail don’t change with zooming in and out.

What can we do differently?

I’m going to sketch an alternative approach to viewing trees like this. I have some ropey code that I’ve used to create the diagrams below. This isn’t ready for prime time, but hopefully illustrates the idea. The key concept is that we zoom NOT by simply expanding the viewing area in the x and y direction, but by collapsing and expanding the tree. Each zoom level corresponds the number of nodes we will show in the tree. We use a criterion to rank the importance of each node in the tree. One approach is how “distinctive” the nodes are, see Jin et al. 2009. We then use a priority queue to chose the nodes to display at a given zoom level (see Libin et al. 2017 and Zaslavsky et al. 2007).

Arguably this gives us a more natural way to zoom a tree, we see the main structure first, then as we zoom in more structure becomes apparent. It turns out if the tree drawing itself is constructed using a “in-order” traversal we can greatly simplify the drawing. Imagine that the tree consists of a number of nodes (both internal and external, i.e., leaves and hypothetical ancestors), and we draw each node on a single line (as if we were using a line printer). Collapsing or expanding the tree is simply a matter of removing or adding lines. If a node is not visible we don’t draw it. If a leaf node is visible we show it as if the whole tree was visible. Internal nodes are slightly different. If it is visible but collapsed we can draw it with a triangle representing the descendants, if it is not collapsed then we draw it as if the whole tree was visible. The end result is that we don’t need to recompute the tree as we zoom in or out, we simply compute which nodes to show, and in what state.

As an experiment I decided to explore the legume tree used in the GBIF website. As is sadly so typical, the original publication of the tree (Ringelberg et al. 2023) doesn’t provide the actual tree, but I found a JSON version on GitHub I then converted that to Newick format so my tools could use it (had a few bumpy moments when I discovered that the tree has negative branch lengths!). The converted file is here:

I then ran the tree through my code and generated views at various zoom levels.

Note that as the tree expands labels are always legible, and zooming only increased the size of the tree in the y-axis (as the expanded nodes take up more space). Note also that we see a number of isolated taxa appearing, such as Lachesiodendron viridiflorum. These taxa are often of evolutionary interest, and also of high conservation interest due to their phylogenetic isolation. Simply showing the whole tree hides these taxa.

Now, looking at these two diagrams there are two obvious limitations. The first is that the black triangles representing collapsed clades are all the same size regardless of whether they represent a few of many taxa. This could be addressed by adding numbers beside each triangle, using colour to reflect the numebr of collapsed nodes, or perhaps by breaking the “one node per row” rule by drawing particularly large nodes over two or more lines.

The other issue is that most of the triangles lack labels. This is because the tree itself lacks them (I added “Ingoid clade”, for example). There will be lots of nodes which can be labelled (e.g., by genus name), but once we start displaying phylogeny we will need to make use of informal names, or construct labels based on the descendants (e.g., “genus 1 - genus 5”). We can also think of having sets of labels that we locate on the tree by finding the least common ancestor (AKA the most recent common ancestor) of that label (hello Phylocode).

Another consideration is what to do with labels as taxa are expanded?. One approach would be to use shaded regions, for example in the last tree above we could shade the clades rooted at Mimosa, Vachellia, and the “Ingoid clade” (and others if they had labels). If we were clever we could alter which clades are shaded based on the zoom level. If we wanted these regions to not overlap (for example, if we wanted bands of colour corresponding to clades to appear on the right of the tree) then we could use something like maximum disjoint sets to choice the best combination of labels.


I don’t claim that this alternative visualisation is perfect (and my implementation of it is very far from perfect). but I think it shows that there are ways we can zoom into trees that reflects tree structure, ensures labels are always legible, and that supports levels of detail (collapsed nodes expanding as we zoom). The use of inorder traversal and three styles of node drawing mean that the diagram is simple to render. We don’t need fancy graphics, we can simply have a list of images.

To conclude, I think it’s great GBIF is moving to include phylogenies. But we can't visualise phylogeny as a static image, it's a structure that requires us to think about how to display it with the same level of creativity that makes web maps such a successful visualisation.


Jin Chen, MacEachren, A. M., & Peuquet, D. J. (2009). Constructing Overview + Detail Dendrogram-Matrix Views. IEEE Transactions on Visualization and Computer Graphics, 15(6), 889–896.

Libin, P., Vanden Eynden, E., Incardona, F., Nowé, A., Bezenchek, A., … Sönnerborg, A. (2017). PhyloGeoTool: interactively exploring large phylogenies in an epidemiological context. Bioinformatics, 33(24), 3993–3995. doi:10.1093/bioinformatics/btx535

Page, R. D. M. (2012). Space, time, form: Viewing the Tree of Life. Trends in Ecology & Evolution, 27(2), 113–120.

Ribeiro, P. G., Luckow, M., Lewis, G. P., Simon, M. F., Cardoso, D., De Souza, É. R., Conceição Silva, A. P., Jesus, M. C., Dos Santos, F. A. R., Azevedo, V., & De Queiroz, L. P. (2018). lachesiodendron , a new monospecific genus segregated from piptadenia(Leguminosae: Caesalpinioideae: mimosoid clade): evidence from morphology and molecules. TAXON, 67(1), 37–54.

Ringelberg, J. J., Koenen, E. J. M., Sauter, B., Aebli, A., Rando, J. G., Iganci, J. R., De Queiroz, L. P., Murphy, D. J., Gaudeul, M., Bruneau, A., Luckow, M., Lewis, G. P., Miller, J. T., Simon, M. F., Jordão, L. S. B., Morales, M., Bailey, C. D., Nageswara-Rao, M., Nicholls, J. A., … Hughes, C. E. (2023). Precipitation is the main axis of tropical plant phylogenetic turnover across space and time. Science Advances, 9(7), eade4954.

Zaslavsky L., Bao Y., Tatusova T.A. (2007) An Adaptive Resolution Tree Visualization of Large Influenza Virus Sequence Datasets. In: Măndoiu I., Zelikovsky A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science, vol 4463. Springer, Berlin, Heidelberg.

Written with StackEdit.