Monday, April 03, 2023

ChatGPT, semantic search, and knowledge graphs

One thing about ChatGPT is it has opened my eyes to some concepts I was dimly aware of but am only now beginning to fully appreciate. ChatGPT enables you ask it questions, but the answers depend on what ChatGPT “knows”. As several people have noted, what would be even better is to be able to run ChatGPT on your own content. Indeed, ChatGPT itself now supports this using plugins.

Paul Graham GPT

However, it’s still useful to see how to add ChatGPT functionality to your own content from scratch. A nice example of this is Paul Graham GPT by Mckay Wrigley. Mckay Wrigley took essays by Paul Graham (a well known venture capitalist) and built a question and answer tool very like ChatGPT.

Because you can send a block of text to ChatGPT (as part of the prompt) you can get ChatGPT to summarise or transform that information, or answer questions based on that information. But there is a limit to how much information you can pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt for example. So a solution is to do some preprocessing. For example, given a question such as “How do I start a startup?” we could first find the essays that are most relevant to this question, then use them to create a prompt for ChatGPT. A quick and dirty way to do this is simply do a text search over the essays and take the top hits. But we aren’t searching for words, we are searching for answers to a question. The essay with the best answer might not include the phrase “How do I start a startup?”.

Enter Semantic search. The key concept behind semantic search is that we are looking for documents with similar meaning, not just similarity of text. One approach to this is to represent documents by “embeddings”, that is, a vector of numbers that encapsulate features of the document. Documents with similar vectors are potentially related. In semantic search we take the query (e.g., “How do I start a startup?”), compute its embedding, then search among the documents for those with similar embeddings.

To create Paul Graham GPT Mckay Wrigley did the following. First he sent each essay to the OpenAI API underlying ChatGPT, and in return he got the embedding for that essay (a vector of 1536 numbers). Each embedding was stored in a database (Mckay uses Postgres with pgvector). When a user enters a query such as “How do I start a startup?” that query is also sent to the OpenAI API to retrieve its embedding vector. Then we query the database of embeddings for Paul Graham’s essays and take the top five hits. These hits are, one hopes, the most likely to contain relevant answers. The original question and the most similar essays are then bundled up and sent to ChatGPT which then synthesises an answer. See his GitHub repo for more details. Note that we are still using ChatGPT, but on a set of documents it doesn’t already have.

Knowledge graphs

I’m a fan of knowledge graphs, but they are not terribly easy to use. For example, I built a knowledge graph of Australian animals Ozymandias that contains a wealth of information on taxa, publications, and people, wrapped up in a web site. If you want to learn more you need to figure out how to write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write the SPARQL queries for us, but it would be much more fun to be simply ask natural language queries (e.g., “who are the experts on Australian ants?”). I made some naïve notes on these ideas Possible project: natural language queries, or answering “how many species are there?” and Ozymandias meets Wikipedia, with notes on natural language generation.

Of course, this is a well known problem. Tools such as RDF2vec can take RDF from a knowledge graph and create embeddings which could in tern be used to support semantic search. But it seems to me that we could simply this process a bit by making use of ChatGPT.

Firstly we would generate natural language statements from the knowledge graph (e.g., “species x belongs to genus y and was described in z”, “this paper on ants was authored by x”, etc.) that cover the basic questions we expect people to ask. We then get embeddings for these (e.g., using OpenAI). We then have an interface where people can ask a question (“is species x a valid species?”, “who has published on ants”, etc.), we get the embedding for that question, retrieve natural language statements that the closest in embedding “space”, package everything up and ask ChatGPT to summarise the answer.

The trick, of course, is to figure out how t generate natural language statements from the knowledge graph (which amounts to deciding what paths to traverse in the knowledge graph, and how to write those paths is something approximating English). We also want to know something about the sorts of questions people are likely to ask so that we have a reasonable chance of having the answers (for example, are people going to ask about individual species, or questions about summary statistics such as numbers of species in a genus, etc.).

What makes this attractive is that it seems a straightforward way to go from a largely academic exercise (build a knowledge graph) to something potentially useful (a question and answer machine). Imagine if something like the defunct BBC wildlife site (see Blue Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and opportunities lost) revived here had a question and answer interface where we could ask questions rather than passively browse.

Summary

I have so much more to learn, and need to think about ways to incorporate semantic search and ChatGPT-like tools into knowledge graphs.

Written with StackEdit.