Friday, June 29, 2012

Planet management, GBIF, and the future of biodiversity informatics

Earth russia large verge medium landscape

Next week I'm in Copenhagen for GBIC, the Global Biodiversity Informatics Conference. The goal of the conference is to:
...convene expertise in the fields of biodiversity informatics, genomics, earth observation, natural history collections, biodiversity research and policy needed to set such collaboration in motion.

The collaboration referred to is the agreement to mobilise data and informatics capability to met the Aichi Biodiversity Targets.

I confess I have mixed feelings about the upcoming meeting. There will be something like 100 people attending the conference, with backgrounds ranging from pure science to intergovernmental policy. It promises to be interesting, but whether a clear vision of the future of biodiversity informatics will emerge is another matter.

GBIC is part of the process of "planet management", a phrase that's been around for a while, but I only came across in the Bowker's essay "Biodiversity Datadiversity"1:

Bowker, G. C. (2000). Biodiversity Datadiversity. Social Studies of Science, 30(5), 643–683. doi:10.1177/030631200030005001

Bowker's essay is well worth a read, not least for the choice quotes such as:

Each particular discipline associated with biodiversity has its own incompletely articulated series of objects. These objects each enfold an organizational history and subtend a particular temporality or spatiality. They frequently are incompletely articulated with other objects, temporalities and spatialities — often legacy versions, when drawing on non-proximate disciplines. If one wants to produce a consistent, long-term database of biodiversity-relevant information the world over, all this sounds like an unholy mess. At the very least it suggests that global panopticons are not the way to go in biodiversity data. (p. 675, emphasis added)

I have not, in general, questioned the mania to name which is rife in the circles whose work I have described. There is no absolutely compelling connection between the observation that many of the world’s species are dying and the attempt to catalogue the world before they do. If your house is on fire, you do not necessarily stop to inventory the contents before diving out the window. However, as Jack Goody (1977) and others have observed, list-keeping is at the heart of our body politic. It is also, by extension, at the heart of our scientific strategies. Right or wrong, it is what we do. (p. 676, emphasis added)

Given that I'm a fan of the notion of a "global panopticon", and spend a lot of time fussing with lists of names, I find Bowker's views refreshing. Meantime, roll on GBIC2012.

1. Bowker cites Elichirigoity as a source of the term "planet management":

Fernando Elichirigoity (1999), Planet Management: Limits to Growth,
Computer Simulations, and the Emergence of Global Spaces (Evanston, IL: Northwestern
University Press). ISBN 0810115875 (Google Books oP3wVnKpGDkC).

From the limited Google preview, and the review by Edwards, this looks like an interesting book:

Edwards, P. (2000). Book Review:Planet Management: Limits to Growth, Computer Simulation, and the Emergence of Global Spaces Fernando Elichirigoity. Isis, 91(4), 828. doi:10.1086/385020 (PDF here)

Thursday, June 28, 2012

Where is the "crowd" in crowdsourcing? Mapping EOL Flickr photos

In any discussion of data gathering or data cleaning the term "crowdsourcing" inevitably comes up. A example where this approach has been successful is the Encyclopedia of Life's Flickr pool, where Flickr users upload images that are harvested by EOL.

Given that many Flickr photos are taken with cameras that have built-in GPS (such as the iPhone, the most common camera on Flickr) we could potentially use the Flickr photos not only as a source of images of living things, but to supplement existing distributional data. For example, Flickr has enough data to fairly accurately construct outlines of countries, cities, and neighbourhoods, see The Shape of Alpha, so what about organismal distribution?

This question is part of a Masters project by Jonathan McLatchie here at Glasgow, comparing distributions of taxa in GBIF with those based on Flickr photos. As part of that project the question arose "where are the Flickr photos being taken?" If most of the photos are being taken in the developed world, then there are at least two problems. The first is the obvious bias against organisms that live elsewhere (i.e., typically many photos won't be taken in those regions where you'd actually like to get more data). Secondly, the presence of zoos, wildlife parks, and botanical gardens means you are likely to get images of organisms well outside their natural range.

Jonathan suggested a "heatmap" of the Flickr photos would help, so to create this I wrote a script to grab metadata for the photos from the Encyclopedia of Life's Flickr pool, extract latitude and longitude, and draw the resulting locations on a map. I aggregated the points into 1°×1° squares, and generated a GBIF-style map of the photos:


Lots of photos from North America, Europe, and Australasia, as one might expect. Coverage of the rest of the globe is somewhat patchy. I guess the key question to ask is extent the "crowd" (Flickr users in this case) is essentially replicating the sampling biases already in projects like GBIF that are aggregating data from museum collections (most of which are in the developed world).

The PHP code to fetch the photo data and create the map is available in github. You'll need a Flickr API key to run the script. The github repository has an SVG version of the map (with a bitmap background). A bitmap copy of the map is available on FigShare

Wednesday, June 27, 2012


Just for future reference:

Monday, June 25, 2012

More fictional taxa and the myth of the expert taxonomic database

I know I'm starting to sound like a broken record, but the more I look, the more taxonomic databases seem to be full of garbage. Databases such as the Catalogue of life, which states that it is a "quality-assured checklist" have records that are patently wrong. Here's yet another example.

If you search for the genus Raymondia in the Catalogue of Life you get multiple occurrences of the same species names, e.g.:

Both of these are listed as "provisionally accepted names", supplied by WTaxa: Electronic Catalogue of Weevil names (Curculionoidea). Clearly we can't have two species with the same name, so what's happening?

Firstly, Hustache, A., 1930 is:

Hustache A (1930) Curculionidae Gallo-Rhénans. Annales de la Société entomologique de France 99: 81-272.

On p. 246 Hustache refers to Raymondionymus fossor Aubé, 1864 (see below).

F168 highres

So, Raymondionymus fossor Hustache, A., 1930 is not a new species but simply the citation of a previously published one (it's a chresonym). Hustache cites the author of the name as Aubé, 1864, and you can see the original description by Aubé in BioStor (Description de six espèces nouvelles de Coléoptères d'Europe dont deux appartenant a deux genres nouveaux et aveugles, So, if the taxonomic authority should be Aubé, 1864, what about Raymondionymus fossor Ganglebauer, L., 1906? Again, if we track down the original publication (Revision der Blindrüsslergattungen Alaocyba und Raymondionymus, it's simply Ganglebauer citing (on p. 142) Aubé's paper, not describing a new species.

Note that the nomenclature of this weevil species is further complicated because Aubé originally described the species as Raymondia fossor, but Raymondia was already in use for a fly (see Über eine neue Fliegengattung: Raymondia, aus der Familie der Coriaceen, nebst Beschreibung zweier Arten derselben, To resolve this homonymy Wollaston proposed the name Raymondionymus:

Wollaston, T. V. (1873). XVIII. On the Genera of the Cossonidae. Transactions of the Royal Entomological Society of London, 21(4), 427–652. doi:10.1111/j.1365-2311.1873.tb00645.x

So, we have a bit of a mess. Unfortunately this mess percolates up through other databases, for example EOL has three different pages for Raymondionymus fossor.

For me the lesson here is that relying on acquiring data from "trusted" sources, curated by "experts" is simply not a tenable strategy for building lists of taxa. If names are essential bits of biodiversity infrastructure upon which we hang other data, then these lists need to be cleaned, which means exposing them to scrutiny, and providing an easy means for errors to be flagged and corrected. Trust is something that is earned, not asserted, and it's time taxonomic databases stop claiming to be authoritative simply because they rely on expert sources. Expertise is no guarantee that you won't make errors.

For me this is one of the key reasons projects like BHL are so important. As more and more of the original literature becomes available, we lessen our reliance on "expertise". We can start to see for ourselves. In other words, "Nullius in verba" ("take nobody's word for it").

Tuesday, June 19, 2012

70,000 articles extracted from the Biodiversity Heritage Library

Biostor shadowJust noticed that BioStor now has just over 70,000 articles extracted from the Biodiversity Heritage Library. This number is a little "soft" as there are some duplicates in the database that I need to clean out, but it's a nice sounding number. Each article has full text available, and in most cases reasonably complete metadata.

Most of the articles in BioStor have been added using semi-automated methods, but there's been rather more manual entry than I'd like to admit. One task that does have to be done manually is attaching plates to papers. This is largely an issue for older publications, where printing text and figures required different processes, resulting in text and figures often being widely separated in the publication. Technology evolved, and the more recent literature doesn't have this problem.

Future plans include adding the ability to download the articles as searchable PDFs, and to support OCR correction, amongst other things. BioStor also underpins some of my other projects, such as the EOL Challenge entry, which as of now has around 80,000 animal names linked to their original description in BioStor (and some 300,000 in total linked to some form of digital identifier). One day I may also manage to get the article locations into BHL itself, so that when you browse a scanned item in BHL you can quickly find individual articles. Oh, and it would be cool to have all this on the iPad...

Monday, June 18, 2012

BHL and text-mining: some ideas

Some quick notes on possibilities for text-mining BHL (in rough order of priority). Any text-mining would have to be robust to OCR errors. I've created a group of OCR-related papers on Mendeley:

OCR - Optical Character Recognition is a group in Computer and Information Science on Mendeley.

Improve finding taxonomic names in text in face of OCR errors

There is some published research on OCR errors that could be used to develop a tool to improve our ability to index OCR text. The outcome would be improved search in BHL (and other archives). I've touched on some of these issues earlier). One approach that looks interesting is using anagram hashing (see Reynaert, 2008), which may be a cheap way to support approximate string matching in OCR text.

Reynaert, M. (2008). Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. Lecture Notes in Computer Science, 4919:617-630. doi:10.1007/978-3-540-78135-6_53 (PDF here).

Recognition and extraction of literature cited

Given an article extract all the references it cites. There's a fair amount of literature on automated citation extraction, but again we need to do this in the face of OCR errors, and enormous variability in citation styles. The outputs could help build citation indexes, and also serve as data for the "bibliography of life". The citations could also be used to help locate further articles in BHL (e.g., using BioStor's OpenURL resolver).

Improved extraction of named entities (e.g., museum specimen codes) and localities (e.g., latitude and longitudes, place names)

This would enable better geographic searches, and help start to link literature to museum specimen databases.

Automated recognition of articles within scanned volumes

My own approach to finding articles has focussed on finding articles based on citation metadata, e.g. based on article title, journal, volume, and pagination, find corresponding article in BHL:

Page, R. D. (2011). Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics, 12(1), 187. doi:10.1186/1471-2105-12-187

An alternative is to infer articles from just the scanned pages. There has been some limited work on this in the context of BHL:

Lu, X., Kahle, B., Wang, J. Z., & Giles, C. L. (2008). A metadata generation system for scanned scientific volumes. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’08 (p. 167). Association for Computing Machinery (ACM).
doi:10.1145/1378889.1378918 (PDF here)

The NLM has some cool stuff on automatically labelling the parts of a document, see Automated Labeling in Document Images and Ground truth data for document image analysis. See also Distance Measures for Layout-Based Document Image Retrieval.

Other links
Should also note that there's a relevant question on StackOverflow about OCR correction, which has links to tools like OCRspell:

Taghva, K., & Stofsky, E. (2001). OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, 3(3), 125–137. doi:10.1007/PL00013558

Code is on github.

Fictional taxa

Anyone who works with taxonomic databases is aware of the fact that they have errors. Some taxonomic databases are restricted in scope to a particular taxon in which one or more people have expertise, these then get aggregated into larger databases, which may in turn be aggregated by databases whose scope is global. One consequence of this is that errors in one database can be propagated through many other databases.

As an example (for reasons I can't remember), I came across the name "Panisopus" (in the water mote family Thyasidae) but was struggling to find any mention of the taxonomic literature associated with this name. If you Google Panisopus the first two pages are full of search results from ITIS, EOL, GBIF, ZipCodeZoo, all listing several species in the genus, and sometimes taxonomic authorities, but no links to the primary literature. If you search BHL for Panisopus you get nothing, nothing at all. It's as if the name didn't exist.

Turns out, that's exactly the point. The name doesn't exist, other than in the various databases that have consumed other databases and recycled this fictional taxon. After some Googling of author's names it became clear that "Panisopus" is probably a misspelling of "Panisopsis", which according to ION was published in:

Viets, K. (1926) Eine nomenklatorische Aenderung im Hydracarinen-Genus Thyas C. L. Koch. Zool Anz Leipzig, 66: 145--148

I can't verify this because this article is not available online. But to give one example, ITIS lists the name "Panisopus pedunculata Keonike, 1895" (TSN 83185). This name should be, as far as I can tell, Panisopsis pedunculata (Koenike, 1895), based on Mitchell, 1954 (, who on page 36 states:


Note that Panisopsis pedunculata was originally described in a different genus (Koenike 1895 preceeds the publication of the genus name by Viets in 1926). We can locate Koenike's original publication "Nordamerikanische Hydrachniden" in BHL, which I've added to BioStor, and the original description appears on p. 192 as Thyas pedunculata (note that ITIS misspells the author's name Koenike [o and e transposed], as well as omitting the parentheses around the name).

What I find a little alarming (if not surprising) is that the entirely fictional genus "Panisopus" its accompanying species have ended up in numerous taxonomic databases, and these databases consistently appear in the top Google searches for this name. The good news is that it's becoming increasingly easy to discover these errors, in part because more and more taxonomic literature is coming online, making it possible for users to investigate matters for themselves, rather than rely on unsupported statements in taxonomic databases. I'm continually amazed by how little evidence most taxonomic databases provide for any of the assertions that they make. If a database includes a name, I want some evidence that the name is "real". Show me the publication, or at least give me a citation that I can follow up. I can't take these databases on blind faith, because demonstrably they are replete with errors. Ironically, one measure of success in the Internet age is being in the top 10 hits for a Google search. Now, if the top ten hits are all taxonomic databases I get very, very nervous. It's a good sign the name only exists in those databases.

Friday, June 15, 2012

BHL to PDF workflow

Just some random thoughts on creating searchable PDFs for article extracted from BHL.


Thursday, June 14, 2012

Taxonomy and the nine billion names of God

In Arthur C. Clarke's short story The Nine Billion Names of God Tibetan monks hire two programmers to help them generate all the the possible names of God. The monks believe that the purpose of the Universe is to generate those names, once that goal is achieved the Universe will end. As the understandably skeptical programmers leave having completed their task, they look up into the sky and notice that "overhead, without any fuss, the stars were going out."

Leaving aside the delicious irony that arises if we recast this story with the monks replaced by taxonomists, much of our work with taxonomic names seems to be enumerating endless permutations of the same names. Part of the problem is the way some databases store and provide access to names.

The simplest way to represent a taxonomic name is to just have the name (the "canonical name"), without additional bits such as the taxonomic authority. In my view, any taxonomic database that serves names should provide the canonical name. I'm not arguing that they shouldn't provide taxonomic authority information (ideally separately, but could also be as part of a canonical name + authority string), I just want them to also provide just the canonical name. For some reason this seems to upset people (e.g., this thread on the TDWG mailing lists), so let me explain why I think this matters.

Most people use taxonomic names without the authority (just Google a taxonomic name with and without it's authority and compare the number of hits). So, if your goal is to be of service to your users, make sure you provide the canonical name.

Then there is the issue of integrating data from different sources. The more parts to the name the more scope there is for ambiguity. For example, my first ever publication was a description of a new species of peacrab, Pinnotheres atrinicola, published in:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904

If we look for this name in ION we discover three records:

Pinnotheres atrinicola Page

Two are duplicates of "Pinnotheres atrinicola", with and without the authority, one is a misspelling ("Pinnotheres atrinacola"). Given just the name we already see that it's easy for people to get the spelling wrong and generate lexical variants.

If we now add the authority we get more potential for variation. ION write the authority as "Page 1983" (no comma), but other databases such as WoRMS write it as Page, 1983 (with comma). So we now have two variations of the name, and two for the authority, so 4 possible strings if we include both name and authority. This combinatorial explosion means that we can rapidly generate lots of strings that are fundamentally the same.

I'm not arguing that taxonomic authorities aren't useful, and I want them wherever they are known, but insisting that databases serve name + authority to the exclusion of just the canonical name is a recipe for disaster. One could argue that users can parse the string into name and authority components, but that's a headache (just take a look at taxon-name-processing for details). Why make users go through hoops to get basic information?

Another reason I'm wary of taxonomic authority strings is that people don't always understand the conventions. For example, in my previous post I used the following example for names that differed in authority string:

  • Demansia torquata Günther 1862
  • Demansia torquata (Günther, 1862)

The use of parentheses seems a small difference, but (a) it means the strings are different, and (b) the presence or absence of parentheses changes the meaning of the authority. In this example, Demansia torquata Günther 1862 means that Günther is the original author of the name Demansia torquata, and so if I search Günther's publications from 1862 for "Demansia torquata" I will find that name. Demansia torquata (Günther, 1862), on the other hand, means that Günther originally described this species in 1862, but he placed it in a different genus, so my search for "Demansia torquata" in 1862 is likely to be fruitless. So, if the authority is actually (Günther, 1862) but a database tells me it's Günther, 1862 I'd be wasting my time looking for the name in 1862.

As it turns out, this snake was originally described as Diemansia torquata (see "On new species of snakes in the collection of the British Museum" The genus name Diemansia differs from Demansia, hence (Günther, 1862) should be correct, but it looks like Diemansia and Demansia are just some of the variations of the same snake genus (see for example *Sigh*

Variation in taxonomic authority extends beyond parentheses. In a post on clustering strings I used examples of taxonomic authorities for the genus Helicella:

Ferrusac 1821
Bonavita 1965
Ferussa 1821
Lamarck 1812
Ferussac 1821

There are six different strings here which correspond to three different authorities. In this example the name Helicella is a homonym (same name used for different taxa) so having the taxonomic authority can help decide which name is actually meant, but people can't seem to agree on how to spell the authority names, and in other cases they might not agree on dates of publication, hence we get variations such as those above. Even when authorities are useful, they come at a cost. And that's not even considering chresonyms where the authority isn't the original author, but instead is a form of citation of the use of a name.

All of this variation is a cause of ambiguity, and when we combine permutations of taxonomic names and taxonomic authorities, things start to get messy. Indeed, I'd argue that projects such as the Global Names Index (GNI) are essentially doing what Arthur C. Clarke's monks were doing, trying to capture near endless permutations of the same names. Given this, it seems crazy not to try and keep things as simple as possible. In the vast majority of cases I want the name, I don't want the rest of the cruff attached to it. Taxonomic authorities are really just proxies for citation, so lets focus on getting that information linked to names, and stop making life difficult for users.

Wednesday, June 13, 2012

Visualising differences between classifications using cluster maps

As part of a project to build a tool to navigate through taxonomic names and classifications I've become interested in quick ways to compare classifications. For example, EOL has multiple classifications for the same taxon, and I'd like to quickly discover what the similarities and differences are.

One promising approach is to use "cluster maps", a technique described by Fluit et al. (see Aduna Cluster Map for an implementation):

Fluit, C., Sabou, M., & Harmelen, F. (2006). Visualizing the Semantic Web. (V. Geroimenko & C. Chen, Eds.) (pp. 45–58). Springer Science + Business Media. doi:10.1007/1-84628-290-X_3 (see also

Cluster map details

Cluster maps can be thought of as fancy Venn Diagrams, in that they can be used to depict the overlap between sets of objects. The diagram is a graph with two kinds of nodes. One represents categories (in the example above, file formats and search terms), the other represents sets of objects that occur in one or more categories (in the example above, these are files that match the search terms "rdf" and "aperture").

I've cobbled together a crude version of cluster maps. For a given taxon (e.g., a genus) I list all the immediate sub-taxa (e.g., species) in each classification in EOL, and then find the sets of sub-taxa that are shared across the classification sources (e.g., ITIS, NCBI, etc.) and those that are unique to one source. I then create the cluster map using Graphviz. Inspired by the hexagonal packing used by Aduna, I've done something similar to display the taxa in each set. Adding these to the output of Graphviz required a little fussing with. First I get Graphviz to output the graph in SVG, then I load the SVG into a program that locates each node in the graph and inserts SVG for the packed circles (given that SVG is XML this is fairly straightforward).

As an example, consider the genus Demansia ( EOL reports four classifications for this genus. Below is a cluster map for this genus:


This diagram show that, for example, the Catalogue of Life (CoL) and Reptile databases share 4 names, these databases share three other names with ITIS. All databases have names unique to themselves, one database (NCBI) is completely disconnected from the other three databases.

One important caveat here is that I'm mapping the scientific names as returned by EOL, and in many cases these contain the taxonomic authority. This is a major headache, prompting this outburst:

If we clean the names by removing the taxonomic authority the clusters overlap rather more:
Now we see that only ITIS and the Reptile Database have unique names. This is one reason why I get stroppy when taxonomists start saying databases shouldn't have to supply cleaned "canonical" names. If the names have authorities then I have to clean them, because in many cases the authorities (while useful to know) are inconsistent across databases. For example:

  • Demansia olivacea GRAY 1842 versus Demansia olivacea (Gray, 1842)
  • Demansia torquata GÜNTHER 1862 versus Demansia torquata (Günther, 1862)

Taxonomic authorities are frequently misspelt, and people seem confused about when to use parentheses or not. Databases should spare the user some pain and provide clean names (and authority strings separately where they have them).

The visualisation is still incomplete (I need to make it interactive), but it shows promise. The names that are unique to one database are usually worth investigating. In some cases they are names other databases regard as synonyms, in other cases they represent spelling variations. The goal of this visualisation is to highlight the names that the user might want to investigate further.

Tuesday, June 12, 2012

Using a zoomable treemap to visualise a taxonomic classification

One visualisation method I keep coming back too is the treemap. Each time I experiment with them I learn a little bit more, but I usually end up abandoning them (with the exception of using quantum treemaps to display bibliographic data). But they keep calling me back.

My latest experiment builds on some earlier thoughts on quantum treemaps, but tackles two issues that have kept bugging me. The first is that quantum treemaps are limited to hierarchies that are only two levels deep (e.g., family → genus → species). This is because, unlike regular treemaps where you are slicing and dicing a rectangle of predetermined size, when you construct a quantum treemap you don't know how big it will be until you've made it (this is because you want to ensure that every item in the hierarchy can be displayed at the same size, and fitting them in may require you to tweak the size of the treemap). Given that taxonomic classifications have > 2 levels this is a problem. One approach is to construct quantum treemaps for the lower parts of the classification, then pack those into a larger rectangle. This is an instance of the packing problem. After Googling for a bit I came up across this code for packing rectangles, which was easy to follow and gave reasonable results.

The second problem is that I want the treemap to be interactive. I want to be able to zoom in and out and navigate around the treemap. After more Googling, I came across the Zoomooz.js library which makes web page elements zoom (for a pretty mind-blowing example of what can be done see impress.js), but I decided I want to work with SVG. After playing with examples from Keith Wood's jQuery SVG plugin I started to get the hang of creating zoomable visualisations in SVG.

Here's a video of what I've come up with so far (you can see this live at This is an interactive display of the Catalogue of Life 2010 classification of primates, with images from EOL. It's crude, there are some obvious issues with redrawing images, labels, etc., but it gives a sense of what can be done. With care this could probably be scaled up to handle the entire Catalogue of Life classification. With a bit more care, it could probably be optimised for the iPad, which would be a fun way to navigate through the diversity of life.

Thursday, June 07, 2012

Discovering species descriptions in digitised newspapers: Trove and The Brisbane Courier

While exploring ways to visually compare classifications I came across the Australian snake name Demansia atra, and ended up reading a series of papers in the Bulletin of Zoological Nomenclature discussing the status of the name (more fun than it sounds, trust me). For example, Smith and Wallach Case 2920. Diemenia atra Macleay, 1884 (currently Demansia atra; Reptilia, Serpentes): proposed conservation of the specific name asked the ICZN to conserve the name, whereas Shea On the proposed conservation of the specific name of Diemenia atra Macleay, 1884 (currently Demansia atra; Reptilia, Serpentes) argued that Hoplocephalus vestigiatus was the correct name for the snake (OK, perhaps not that much fun.)

The reason I bring this up is that the original description of Hoplocephalus vestigiatus was published in an Australian newspaper, the The Brisbane Courier, 13 September 1884 (!). This newspaper has been digitised and is available in Trove, a digital archive hosted by the National Library of Australia. The description of Hoplocephalus vestigiatus appears in an account of a meeting of the Royal Society of Queensland:

The Trove newspaper archive has both scanned images and OCR text, rather like BHL, but also enables users to correct OCR errors. The original text looked like this:

Mr De Vis then read a communication en
titled " Desciiptioua of >icw Snakes' -
This papei fe ive the descriptions of four
udditions to our Austi alian snake fauna,
and was prefaced by a synopsis and ditfii ential
characters of the genoa Huplo fjihaliui, to
which two of the anakca were referred-a genus
which Mr De Via stated w as out of all propor
tion larger than any other of anakea in Aus
traha, and, though consisting for the most
deadly reptile of Queensland-the brown
banded or "tiger snake " The new sn ikes were
Hoplocephalus snlcans-the furrow snake, so
called from the faculty the reptile has of con
verting ita ventral surface into a continuous
furrow, forwaidcd by Mr. C W de Burgh
Birch from the Mitchell district Hnplore
phalus lestigialiis, the foot punt snake, a name
of the white mai kings upon its back to tracks
of feet, Cacoplm, Wari o from Warroo station,
in the Port Curtía district, where it was
collected by Mi Llackman, one of the mern
bcis of the soeietj , and IJrarln/ioma Snther
lundi, a snakefrom Carl Creek, Norman River,
dcdicatcel to Mi J Sutherland, of normanton,

A few quick edits in Trove and it looks like this:

Mr. De Vis then read a communication en-
titled "Descriptions of New Snakes."—
This paper gave the descriptions of four
additions to our Australian snake fauna,
and was prefaced by a synopsis and differential
characters of the genus Hoplocephalus to
which two of the snakes were referred—a genus
which Mr. De Vis stated was out of all propor-
tion larger than any other of snakes in Aus-
tralia, and, though consisting for the most
deadly reptile of Queensland—the brown
banded or "tiger snake." The new snakes were
Hoplocephalus sulcans—the furrow snake, so
called from the faculty the reptile has of con-
verting its ventral surface into a continuous
furrow, forwarded by Mr. C. W. de Burgh
Birch from the Mitchell district . Hoploce-
phalus vestigitatus, the foot-print snake, a name
said to be suggested by the fancied resemblance
of the white markings upon its back to tracks
of feet; Cacophis Warro, from Warroo station,
in the Port Curtis district, where it was
collected by Mr. Blackman, one of the mem-
bers of the society; and Brachysoma Suther-
landi, a snake from Carl Creek, Norman River,
dedicated to Mr. J. Sutherland, of Normanton.
Searching Trove for Hoplocephalus I discovered a number of articles on snakes, some of which have also had their OCR text corrected, a measure of the success the project has had in engaging users. Trove has come up several times in discussions abut OCR correction and BHL, but this is the first time I've taken a closer look — I didn't expect to find species descriptions in an Australian newspaper.

Saturday, June 02, 2012

Linking NCBI taxonomy to GBIF

In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (, and to GBIF ( There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.


So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:

<rdfs:seeAlso rdf:resource=""/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Friday, June 01, 2012

Can you trust EOL?

There's a recent thread on the Encyclopedia of Life concerning erroneous images for the crab Leptograpsus. This is a crab I used to chase around rooks on stormy west-coast beaches near Auckland, so I was a little surprised to see the EOL page for Leptograpsus looks like this:


The name and classification is the crab, but the image is of a fish (Lethrinus variegatus). Perhaps at some point in aggregating the images the two taxa, which share the abbreviated name "L.variegatus" got mixed up.

Now, errors like this are bound to happen in a project the size of EOL, and EOL has some pretty active efforts to clean up errors (e.g., the Homonym Hunters). But what bothers me about this example is the prominent label Trusted that appears below the image. If I look at all the images for Leptograpsus on EOL, I see "trusted" images for fish. All images of the crab (i.e., the real Leptograpsus) are labelled "unreviewed" and implicitly "untrusted":


If you are going to claim something is "trusted" you need to be very careful. The images of the fish may well come from a trusted source (FishBase), and FishBase's assertion that the image is of Lethrinus variegatus may well be "trusted", but I certainly can't trust the assertion made by EOL that this image depicts a crab.

In this example the error is easy to spot (if you know that crabs and fish are different), but what if the error was more subtle? Or what if you are using EOL's API and explicitly asking for only content you can trust? Then you get the fish images (see

If I can't trusted "trusted" then EOL has a problem. One way forward is to unpack the notion of "trust" and make sure the user knows what "trusted" means. In this case there are at least two assertions being made:
  1. This image is of a fish (made by FishBase)
  2. This image is of a crab (made by EOL)

EOL needs to make clear what assertions are being made, and which ones it is stating can be "trusted". Ideally it also needs to move away from blanket assertions of "trusted" versus not trusted, because that's far too coarse (just because FishBase knows about fish I'm not sure I'm going to put equal trust that every image it contains has been correctly identified). Trust is something that is conferred by users and acquired over time, not something to be simply asserted.