All you’re doing here is “front loading” AI: Imstead of running slow and expensive LLMs at query time, you run them at index time.
It’s a method for data augmentation or, in database lingo, index building. You use LLMs to add context to chunks that doesn’t exist on either the word level (searchable by BM25) or the semantic level (searchable by embeddings).
A simple version of this would be to ask an LLM:
“List all questions this chunk is answering.” [0]
But you can do the same thing for time frames, objects, styles, emotions — whatever you need a “handle” for to later retrieve via BM25 or semantic similarity.
I dreamed of doing that back in 2020, but it would’ve been prohibitively expensive. Because it requires passing your whole corpus through an LLM, possibly multiple times, once for each “angle”.
That being said, I recommend running any “Graph RAG” system you see here on HN over some 1% or so of your data. And then look inside the database. Look at all text chunks, original and synthetic, that are now in your index.
I’ve done this for a consulting client who absolutely wanted “Graph RAG”. I found the result to be an absolute mess. That is because these systems are built to cover a broad range of applications and are not adapted at all to your problem domain.
So I prefer working backwards:
What kinds of queries do I need to handle? What does the prompt to my query time LLM need to look like? What context will the LLM need? How can I have this context for each of my chunks, and be able to search by match air similarity? And now how can I make an LLM return exactly that kind of context, with as few hallucinations and as little filler as possible, for each of my chunks?
This gives you a very lean, very efficient index that can do everything you want.
[0] For a prompt, you’d add context and give the model “space to think”, especially when using a smaller model. Also, you’d instruct it to use a particular format, so you can parse out the part that you need. This “unfancy” approach lets you switch out models easily and compare them against each other without having to care about different APIs for “structured output”.
Prompts are a great place to look for these, but the part you linked too isn't very important for knowledge graph generation. It is doing an initial semantic breakdown into more manageable chunks. The actual entity and fact extraction that actually turns this into a knowledge graph is this one:
GraphRAG and a lot of the semantic indexes are simply vector database with pre-computed similarity edges which does not allow you to perform any reasoning over (the definition and intention of a knowledge graph).
This is probably worth looking at, its the first opensource project I've seen that is actually using LLMs to generate knowledge graphs. This does look pretty primitive for that task but it might be a useful reference for others going down this road.
To my knowledge most graph RAG implementations, including the Microsoft research project, rely on LLM entity extraction (subject-predicate-object triplets) to build the graph.
>I found the result to be an absolute mess. That is because these systems are built to cover a broad range of applications and are not adapted at all to your problem domain.
Same findings here, re: legal text. Basic hybrid search performs better. In this use case the user knows what to look for, so the queries are specific. The advantage of graph RAG is when you need to integrate disparate sources for a holistic overview.
If you have to deal with domain specific data, then this would not work as well. I mean it will get you an incremental shift (based on what I see, it's just creating explicit relationships at the index time instead of letting the model do it at runtime before generating an output. Effective incrementally, but depends on type of data.) yes, though not enough to justify redoing your own pipeline. You are likely better off with your current approach and developing robust evals.
If you want a transformational shift in terms of accuracy and reasoning, the answer is different. Many a times RAG accuracy suffers because the text is out of distribution, and ICL does not work well. You get away with it if all your data is in public domain in some form (ergo, llm was trained on it), else you keep seeing the gaps with no way to bridge them. I published a paper around it and how to effciently solve it, if interested. Here is a simplified blog post on the same: https://medium.com/@ankit_94177/expanding-knowledge-in-large...
Edit: Please reach out here or on email if you would like further details. I might have skipped too many things in the above comment.