iPhylo: May 2019

Roderic D. M. Page

Tuesday, May 28, 2019

Frankenplace, geospatial search, and discrete global grid systems

Quick note on Frankenplace, a cool search tool that displays the geographic distribution of documents that match the user's query as a heatmap. Details of how the tool works are given in:

B. Adams, G. McKenzie, and M. Gahegan (2015) Frankenplace: Interactive Thematic Mapping for Ad Hoc Exploratory Searching. 24th International World Wide Web Conference (WWW 2015), http://dx.doi.org/10.1145/2736277.2741137

At the heart of the method is a discrete global grid that divides the world up into small areas of the same size. Topics are then geographically indexed, so that when a user searches, say, for "ebola", areas relevant to that query are highlighted (in this case, areas in Africa). It's striking example of querying data geographically, and one which I hope to explore further in the context of BHL and BioStor.

Update

I've put some notes on various discrete global grid systems in a repo on GitHub: RDF and discrete global grid systems.

Ozymandias meets Wikipedia, with notes on natural language generation

I've tweaked Ozymandias to now include short natural language summaries (snippets) for various taxa. This makes the output a little more friendly and informative. For example, here's a snippet from the page on Cephalodesmius, a dung beetle that makes its own dung.

These snippets come from Wikipedia, well actually, from the DBpedia project. Behind the scenes I have a script that takes the GBIF taxon id for an ALA taxon (if it exists), queries Wikidata for the corresponding taxon and any associate identifiers of interest, and if there's a link to an English language Wikipedia page I do a quick SPARQL query to DBpedia to retrieve the snippet of text. At some point all of this could be sped up by adding the relevant data to the triple store and doing the query locally but for now it works well enough.

Of course, many snippets are little more than stubs, e.g. the snippet for another dung beetle genus Diorygopyx doesn't tell us much more than we can get from the information already displayed.

But having a text summary still seems worthwhile, which raises the question of what to do when Wikipedia doesn't know anything about a taxon? Obviously, we could start editing Wikipedia to flesh out its content, but that will take a while to filter into databases such as DBpedia. Another approach is to generate snippets from the triple store itself, in other words, generate natural language summaries from structured data. For example, we could generate summaries such as "Diorygopyx is a genus of Scarabaeidae or scarab beetles in the superfamily Scarabaeoidea" fairly easily from knowing the taxonomic hierarchy and a few common names. But we could also do more. In browsing Ozymandias I'm struck at times by how much our knowledge of one taxon depends on a major piece of taxonomic work, often done some time ago. For example, The Australian Crickets (Orthoptera: Gryllidae). Academy of Natural Sciences of Philadelphia, Monograph 22 by Otte and Alexander (1983) is a monumental taxonomic monograph, and many Australian cricket genera had most (or all) of their species described in that work. Imagine having a snippet that mentioned that (e.g., "Most species in this genus were described in 1983, no species have been discovered since."). That would give the reader some useful information, and perhaps also prompt them to ask "so, why haven't any more species been described?".

I think there's scope here to make the output from triple stores (and other databases) more approachable using natural language generation. This is obviously a big area, and there are some very sophisticated approaches for outputting very natural language (think chatbots), perhaps the most striking example of which is Google Duplex.

But we don't need quite this level of sophistication, something using much simpler techniques (e.g., nalgene-js) would probably be enough. Armed with some basic facts from the triple store, and some simple templates, we could probably generate some useful text snippets for many taxa in Ozymandias, and indeed for other entities. For example, David Shorthouse is outputting simple text summaries of the contribution of taxonomists to specimen collection and identification:

Arthur Loveridge identified Scolecomorphidae and collected Chamaeleonidae https://t.co/NimKzmUuOX
— Bloodhound (@BloodhoundTrack) May 24, 2019

Imagine extending this to take into account publications, geography, etc. I think there's lots of scope here for moving beyond just displaying data and trying to generate human-friendly summaries of data.