Friday, September 11, 2015

Possible project: natural language queries, or answering "how many species are there?"

Google Google knows how many species there are. More significantly, it knows what I mean when I type in "how many species are there". Wouldn't it be nice to be able to do this with biodiversity databases? For example, how many species of insect are found in Fiji? How would you answer this question? I guess you'd Google it, looking for a paper. Or you'd look in vain on GBIF, and then end up hacking some API queries to process data and come up with an estimate. Why can't we just ask?

On the face of it, natural language queries are hard, but there's been a lot of work down in this area. Furthermore, there's a nice connection with the idea of knowledge graphs. One approach to natural language parsing is to convert a natural language query to a path in a knowledge graph (or, if you're Facebook, the social graph). Facebook has some nice posts describing how their graph search works (e.g., Under the Hood: Building out the infrastructure for Graph Search), and there's a paper describing some of the infrastructure (e.g., "Unicorn: a system for searching the social graph" doi:10.14778/2536222.2536239, get the PDF here).

Natural language queries can seem potentially unbounded, in the sense that the user could type in anything. But there are ways to constrain this, and ways to anticipate what the user is after. For example, Google suggests what you may be after, which gives us clues as to the sort of questions we'd need answers for. It would be a fun exercise to use Google suggest to discover what questions people are asking about biodiversity, then determine what would it take to be able to answer them.

Suggest All very sensible questions that existing biodiversity databases would struggle to answer.

There's a nice presentation by Kenny Bastani where he tackles the problem of bounding the set of possible questions by first generating the questions for which he answers, then caching those so that the user can select from them (using, for example, a type-ahead interface).

Hence, we could generate species counts for all major and/or charismatic taxa for each country, habitat type (or other meaningful category), then generate the corresponding query (e.g., "how many species of birds are there in Fiji", where the yellow and cyan" terms are the things we replace for each query).

One reason this topic appeals is that it is intimiately linked to the idea of a biodiversity knowledge graph, in that answers to a number of questions in biodiversity can be framed as paths in that graph. Do, if we build the graph we should also be asking about ways to query it. In particular, how do we answer the most basic questions of the information we are aggregating in myriad databases.