Interesting! Could you give an example with a bit more specific detail here? I take it there's some kind of work output, like a report, in a semi-structured format, and the goal is to automate creation of these. And you would provide a UX that lets them explain what they want the system to create?
There are multiple long-form text inputs, one set is provided by User A, and another set by User B. User A inputs act as a prompt for User B, and then User A analyzes User B's input according to the original User A inputs, producing an output.
My system takes User A and B inputs and produces the output with more accuracy and precision than User As do, but a wide margin.
Instead of trying to train a model on all the history of these inputs and outputs, the solution was a combination of goal->job->task breakdown (like a fixed agentic process), and lots of context and prompt engineering. I then test against customer legacy samples, and inspect any variances by hand. At first the variances were usually system errors, which informed improvements to context and prompt engineering, and after working through about a thousand of these (test -> inspect variance -> if system mistake improve system -> repeat) iterations, and benefiting from a couple base-model upgrades, the variances are now about 99.9% user error (bad historical data or user inputs) and 0.1% system error. Overall it took about 9 months to build, and this one niche is worth ~$30m a year revenue easy, and everywhere I look there are market niches like this... it's ridiculous. (and a basic chat interface like ChatGPT doesn't work for these types of problems, no matter how smart it gets, for a variety of reasons)
So to summarize:
Instead of training a model on the historical inputs and outputs, the solution was to use the best base model LLMs, a pre-determined agentic flow, thoughtful system prompt and context engineering, and an iterative testing process with a human in the loop (me) to refine the overall system by carefully comparing the variances between system outputs and historical customer input/output samples.
Since the Claude Code docs suggest installing Ripgrep, my guess is that they mean that Claude Code often runs searches to find snippets to improve in the context.
I would argue that this is still RAG. There's a common misconception (or at least I think it's a misconception) that RAG only counts if you used vector search - I like to expand the definition of RAG to include non-vector search (like Ripgrep in this case), or any other technique where you use Retrieval techniques to Augment the Generation phase.
I agree that retrieval can take many forms besides vector search, but do we really want to call it RAG if the model is directing the search using a tool call? That like an important distinction to me and the name "agentic search" makes a lot more sense IMHO.
Yes, I think that's RAG. It's Retrieval Augmented Generation - you're retrieving content to augment the generation.
Who cares if you used vector search for the retrieval?
The best vector retrieval implementations are already switching to a hybrid between vector and FTS, because it turns out BM25 etc is still a better algorithm for a lot of use-cases.
"Agentic search" makes much less sense to me because the term "agentic" is so incredibly vague.
I think it depends who "you" is. In classic RAG the search mechanism is preordained, the search is done up front and the results handed to the model pre-baked. I'd interpret "agentic search" as anything where the model has potentially a collection of search tools that it can decide how to use best for a given query, so the search algorithm, the query, and the number of searches are all under its own control.
Exactly. Was the extra information pushed to the model as part of the query? It’s RAG. Did the model pull the extra information in via a tool call? Agentic search.
rag is an acronym with a pinned meaning now. just like the word drone. drone didnt really mean drone, but drone means drone now. no amount of complaining will fix it. :[
1. RAG: A simple model looks at the question, pulls up some associated data into the context and hopes that it helps.
2. Self-RAG: The model "intentionally"/agentically triggers a lookup for some topic. This can be via a traditional RAG or just string search, ie. grep.
3. Full Context: Just jam everything in the context window. The model uses its attention mechanism to pick out the parts it needs. Best but most expensive of the three, especially with repeated queries.
Aider uses kind of a hybrid of 2 and 3: you specify files that go in the context, but Aider also uses Tree-Sitter to get a map of the entire codebase, ie. function headers, class definitions etc., that is provided in full. On that basis, the model can then request additional files to be added to the context.
I'm still not sure I get the difference between 1 and 2. What is "pulls up some associated data into the context" vs ""intentionally"/agentically triggers a lookup for some topic"?
1. Tends to use embeddings with a similarity search. Sometimes called "retrieval". This is faster but similarity search doesn't alway work quite as well as you might want it to.
2. Instead lets the agent decide what to bring into context by using tools on the codebase. Since the tools used are fast enough, this gives you effectively "verified answers" so long as the agent didn't screw up its inputs to the tool (which will happen, most likely).
Does it make sense to use vector search for code? It's more for vague texts. In the code relevant parts can be found by exact name match. (in most cases. both methods aren't exclusive)
Vector search for code can be quite interesting - I've used it for things like "find me code that downloads stuff" and it's worked well. I think text search is usually better for code though.
Interesting! So you basically got a LM to rephrase the search phrase/keys into the style of the target documents, then used that in the RAG pipeline? Did you do an initial search first to limit the documents?
IIUC they're doing some sort of "q/a" for each chunk from documents, where they ask an LLM to "play the user role and ask a question that would be answered by this chunk". They then embed those questions, and match live user queries with those questions first, then maybe re-rank on the document chunks retrieved.
reply