Tuesday, June 15, 2021

Compiling a C++ application to run on Heroku

TL;DR Use a buildpack and set "LDFLAGS=--static" --disable-shared

I use Heroku to host most of my websites, and since I mostly use PHP for web development this has worked fine. However, every so often I write an app that calls an external program written in, say, C++. Up until now I've had to host these apps on my own web servers. Today I finally bit the bullet and learned how to add a C++ program to a Heroku-hosted site.

In this case I wanted to add CRF++ to an app for parsing citations. I'd read on Stack Overflow that you could simply log into your Heroku instance using

heroku run bash
and compile the code there. I tried that for CRF++ but got a load of g++ errors, culminating in:

configure: error: Your compiler is not powerful enough to compile CRF++.

Turns out that the g++ compiler is only available at build time, that is, when the Heroku instance is being built before it is deployed. Once it is deployed g++ is no longer available (I'm assuming because Heroku tries to keep each running instance as small as possible).

So, next I tried using a buildpack, specifically felkr/heroku-buildpack-cpp. I forked this buildpack, and added it to my Heroku app (using the "Settings" tab). I put the source code for CRF++ into the root folder of the GitHub repository for the app (which makes things messy but this is where the buildpack looks for either Makefile or configure) then when the app is deployed CRF++ is compiled. Yay! Update: with a couple of tweaks I moved all the code into a folder called src and now things are a bit tidier.

Not so fast, I then did

heroku run bash
and tried running the executable:

heroku run bash -a <my app name>
./crf_learn
/app/.libs/crf_learn: error while loading shared libraries: libcrfpp.so.0: cannot open shared object file: No such file or directory

For whatever reason the executable is looking for a shared library which doesn’t exist (this brought back many painful memories of dealing with C++ compilers on Macs, Windows, and Linux back in the day). To fix this I edited the buildpack compile script to set the "LDFLAGS=--static" --disable-shared flags for configure. This tells the compiler to build static versions of the libraries and executable. Redeploying the app once again everything now worked!

The actual website itself is a mess at the moment so I won't share the link just yetUpdate: see Citation parsing tool released for details, but it's great to know that I can have both a C++ executable and a PHP script hosted together without (too much) pain. As always, Google and Stack Overflow are your friends.

Friday, June 04, 2021

Thoughts on BHL, ALA, GBIF, and Plazi

If you compare the impact that BHL and Plazi have on GBIF, then it's clear that BHL is almost invisible. Plazi has successfully in carved out a niche where they generate tens of thousands of datasets from text mining the taxonomic literature, whereas BHL is a participant in name only. It's not as if BHL lacks geographic data. I recently added back a map display in BioStor where each dot is a pair of latitude and longitude coordinates mentioned in an article derived from BHL's scans.

This data has the potential to fill in gaps in our knowledge of species distributions. For example, the Atlas of Living Australia (ALA) shows the following map for the cladoceran (water flea) Simocephalus:

Compare this to the localities mentioned in just one paper on this genus:

Franklin, D. C., Michael, B., & Mace, M. (2005). New location records for some butterflies of the Top End and Kimberley regions. Northern Territory Naturalist, 18, 1–7. Retrieved from https://biostor.org/reference/254167

There are records in this paper for species that currently have no records at all in ALA (e.g., Simocephalus serrulatus):

As it stands BioStor simply extracts localities, it doesn't extract the full "material citation" from the text (that is, the specimen code, date collected, locality, etc. for each occurrence). If it did, it would then be in a position to contribute a large amount of data to ALA and GBIF (and elsewhere). Not only that, if it followed the Plazi model this contribution would be measurable (for example, in terms of numbers of records added, and numbers of data citations). Plazi makes some of its parsing tools available as web services (e.g., http://tb.plazi.org/GgWS/wss/test and https://github.com/gsautter/goldengate-webservices), so in principle we could parse BHL content and extract data in a form usable by ALA and GBIF.

Notes on Plazi web service

The endpoint is http://tb.plazi.org/GgWS/wss/invokeFunction and it accepts POST requests, e.g. data=Namibia%3A%2058%20km%20W%20of%20Kamanjab%20Rest%20Camp%20on%20road%20to%20Grootberg%20Pass%20%2819%C2%B038%2757%22S%2C%2014%C2%B024%2733%22E%29&functionName=GeoCoordinateTaggerNormalizing.webService&dataUrl=&dataFormat=TXT and returns XML.