Saturday, June 17, 2023

A taxonomic search engine

How to cite: Page, R. (2023). A taxonomic search engine. https://doi.org/10.59350/r3g44-d5s15

Tony Rees commented on my recent post Ten years and a million links. I’ve responded to some of his comments, but I think the bigger question deserves more space, hence this blog post.

Tony’s comment

Hi Rod, I like what you’re doing. Still struggling (a little) to find the exact point where it answers the questions that are my “entry points” so to speak, which (paraphrasing a post of yours from some years back) start with:

  • Is this a name that “we” (the human race I suppose) recognise as having been used for a taxon (think Global Names Resolver, Taxamatch, etc.) - preferably an automatable query and response (i.e., a machine can ask it and incorporated the result into a workflow)
  • Does it refer to a currently accepted taxon or if not, what is the accepted equivalent
  • What is its taxonomic placement (according to one or a range of expert systems)
  • Also, for reporting/comparison/analysis purposes…
    - How many accepted taxa (at whatever rank) are currently known in group X (or the whole world)
  • How many new names (accepted or unaccepted) were published in year A (or date range A-C)
  • How many new names were published (or co-authored) by author Z
  • (and probably more)

Having access to more of the primary literature is great, and necessary, but does not help me in those respects (since the published works must still be parsed by a human, not a machine). But maybe it does answer some other questions like how many original works were published by author Z, in a particular time frame.

Of course as you will be aware, using ORCIDs for authors is only a small portion of the puzzle, since ORCIDs are not issued for deceased authors, or those who never request one, so far as I am aware.

None of the above is a criticism of what you are doing! Just trying to see if I can establish any new linkages to what you are doing which will enable me to automate portions of my own efforts to a greater degree (let machines do things that currently still require a human). So far (as evidenced by the most recent ION data dump you were able to supply) it is giving me a DOI in many cases as a supplement to the title of the original work (per ION/Zoological Record) which is something of a time saver in my quest to read the original work (from which I could extract the DOI as well once reached) but does not really automate anything since I still have to try and find it in order to peruse the content.

Mostly random thoughts above, possibly of no use, but I do ruminate on the universe of connected “things not strings” in the hope that one day life will get easier for biodiversity informatics workers, or maybe that the “book of life” will be self-assembling…

My response

I think there are several ways to approach this. I’ll walk through them below, but TL;DR

  • Define the questions we have and how we would get the answers. For example, what combination database and SQL queries, or web site and API calls, or knowledge graph and SPARQL do we need to answer each question?
  • Decide what sort of interface(s) we want. Do we want a web site with a search box, a bunch of API calls, or a natural language interface?
  • If we want natural language, how do we do that? Do we want a ChatBot?
  • And as an aside, how can we speed up reading the taxonomic literature?

The following are more notes than a reasoned essay. I wanted to record a bunch of things to help me think about these topics.

Build a natural language search engine

One of the first things I read that opened my eyes to the potential of OpenAI-powered tools and how to build them was Paul Graham GPT which I talked about here. This is a simple question and answer tool that takes a question and returns an answer, based on Paul Graham’s blog posts. We could do something similar for taxonomic names (or indeed, anything where we have some text and want to query it). At its core we have a bunch of blocks of text, embeddings for those blocks, then we get embeddings for the question and find the best-matching embeddings for the blocks of text.

Generating queries

One approach is to use ChatGPT to formulate a database query based on a natural langue question. There have been a bunch of people exploring generating SPARQL queries from natural langue, e.g. ChatGPT for Information Retrieval from Knowledge Graph, ChatGPT Exercises — Generating a Course Description Knowledge Graph using RDF, and Wikidata and ChatGPT and this could be explored for other query languages.

So in this approach we take natural language questions and get back the queries we need to answer those questions. We then go away and run those queries.

Generating answers

This still leaves us with what to do with the answers. Given, say, a SPARQL response, we could have code that generates a bunch of simple sentences from that response, e.g. “name x is a synonym of name y”, “there are ten species in genus x”, “name x was published in the paper zzz”, etc. We then pass those sentences to an AI to summarise into nicer natural language. We should aim for something like the Wikipedia-derived snippets from DBpedia (see Ozymandias meets Wikipedia, with notes on natural language generation). Indeed, we could help make more engaging answers by adding DBpedia snippets for the relevant taxa, abstracts from relevant papers, etc. to the SPARQL results and ask the AI to summarise all of that.

Skipping queries altogether

Another approach is to generate all the answers ahead of time. Essentially, we take our database or knowledge graph and generate simplified sentences summarising everything we know: “species x was described by author y in 1920”, “species x was synonymies with species y in 1967”, etc. We then get embeddings for these answers, store them in a vector database, and we can query them using a chatbot-style interface.

There is a big literature on embedding RDF (see RDF2vec.org), and also converting RDF to sentences. These “RDF verbalisers” are further discussed on the WebNLG Challenge pages, and an example is here: jsRealB - A JavaScript Bilingual Text Realizer for Web Development.

This approach is like the game Jeopardy!: we generate all the answers and the goal is to match the user’s question to one or more of those answers.

Machine readability

Having access to more of the primary literature is great, and necessary, but does not help me in those respects (since the published works must still be parsed by a human, not a machine).

This is a good point, but help is at hand. There are a bunch of AI tools to “read” the literature for you, such as SciSpace’s Copilot. I think there’s a lot we could do to explore these tools. We could also couple them with the name - publication links in the ten year library. For example, if we know that there is a link between a name and a DOI, and we have the text for the article with that DOI we could then ask targeted questions regarding what the papers says about that name. One way to implement this is to do something similar to the Paul Graham GPT demo described above. We take the text of the paper, chunk it into smaller blocks (e.g., paragraphs), get embeddings for each block, add those to a vector database, and we can then search that paper (and others) using natural language. We could imagine an API that takes a paper and splits out the core “facts” or assertions that the paper makes. This also speaks to Notes on collections, knowledge graphs, and Semantic Web browsers where I bemoaned the lack of a semantic web browser.

Summary

I think the questions being asked are all relatively straightforward to answer, we just need to think a little bit about the best way to answer them. Much of what I’ve written above is focussed on making such a system more broadly useful and engaging, with richer answers that a simple database query. But a first step is to define the questions and the queries that would answer them, then figure out what interface to wrap this up in.

Written with StackEdit.