From a consumer perspective, this is a super interesting paper because it touches on one of the fundamental issues with most RAG beyond the toy case - that you need to do different stuff depending on what the user is asking for. You also (usually) can't just ask because most users don't know that LLMs are bad at math or semantic search won't be sufficient to answer questions that involve enumeration or totality. And while you can always add more steps to your RAG pipeline, some of those steps may be computationally expensive or not particularly relevant to the question at hand.
That being said, it is a bit frustrating that so much RAG research focuses on multi-hop approaches with LLMs. IME multiple round trips to an LLM is essentially a non-starter for any serious consumer product as it's far too slow. Smaller models can struggle to follow instructions so they often can't be an adequate replacement even for simpler tasks. Curious to hear if other folks working in this space have had any success thinking critically about these types of problems!
That depends on the model, you can run stuff in parallel and sometimes keep everything timely. You shouldn't be waiting till the last second to start running rag, you can be pre-emptively building context based on the current chat (like a human does) so that you've already got stuff summarized and ready to fire off when the final prompt does come.
Think about how a human will draw out a conversation around answering a question and use delaying words and phrases to let them continue answering when they don't have the solution fully formulated. LLMs can use the same tactic.
Sure, if you're running a customer service chatbot, you can ask customers what the problem is, then start running rag async to populate a proper context for a smart LLM, and have the chatbot continue asking some questions to clarify details to give the background RAG process time to fetch data and run a quick summary, then have the chatbot give some indication it's thinking, run the full context query on the smart LLM, generate a summary answer then feed it back to the chat LLM and say "I may have found a solution to your problem" then switch to the response from the smart LLM.
I see what you're saying, but you're assuming that consumer products are always chatbots (and that a small language model can buy time interacting with the user while possibly providing additional context). That being said, I would be interested to see such a system in practice - any examples you can point me to? My more general point was not chat-related; much of the research around RAG seems to use LLMs to parse or route the user's query, improve retrieval, etc. which doesn't often work in practice.
This is where the opportunity for creativity comes in. You could allow a chat based refinement to search queries, or provide popup refinement buttons that narrow the search space, and build the search results iteratively rather than the old paradigm of "search" -> "results"
I'm still not sold on recall at such large context window sizes. It's easy for an LLM to find a needle in a haystack, but in most RAG use-cases it's like finding a needle in a stack of needles, and the benchmarks don't really reflect that. There's also the speed and cost implications of dumping millions of tokens into a prompt - it's prohibitively slow and expensive right now.
It's still much cheaper to run RAG in production (at least if you are using closed models). I'd love to use the entire context of GPT4, but if I do that in production it'll cost much more than using some RAG-dependent implementation.
yes - private data, real-time data, curated data, citations with no hallucinations, RAG on tabular data, RAG on video, RAG on hierarchical mixed data, RAG over a graph
This seems similar to building a RAG router (1) to perform dynamic retrieval/querying over data.
After getting hundreds of questions on my Interactive Resume AI chatbot (2), I've found the user queries can be categorized as: greeting, professional skills question, professional experience question, personal/hobby question and common interview question.
I am currently working on building a RAG router to help improve the quality of Q&A responses. I currently use gpt3.5 turbo without any special RAG techniques and the quality is lacking on performing Q&A over my resume and Q&A csv file. GPT4 works well but is too expensive.
Teaching LLMs how to search is probably going to be key to make them hallucinate far less. Most RAG approaches currently use simple vector searches to pull out information. Chat GPT actually is able to run Bing searches. And presumably Gemini uses Google's search. It's fairly clunky and unsophisticated currently.
These searches are still relatively dumb. With LLMs not being half bad at remembering a lot of things, programming simple solutions to problems, etc. a next step could be to make them come up with a query plan to retrieve the information they need to answer a question that is more sophisticated than just calculating a vector for the input, fetching n results and adding those to the context, and calling it a day.
Our ability to Google solutions to problems is inferior to that of an LLM able to generate far more sophisticated, comprehensive, and exhaustive queries against a wide range of databases and sources and filter through the massive amount of information that comes back. We could do it manually but it would take ages. We don't actually need LLMs to know everything there is to know. We just need them be able to know where to look and evaluate what they find in context. Sticking to what they find rather than what they know means their answers are as good as their ability to extract, filter and rank information that is factual and reputable. That means hallucination becomes less of a problem because it can all be tracked back to what they found. We can train them to ask better questions rather than hallucinate better answers.
Having done a lot of traditional search related stuff in the past 20 years, I got really excited about RAG when I first read about it because I realized two things: most people don't actually know a lot but they can learn how to find out (e.g. Googling stuff). And, learning how to find stuff isn't actually that hard.
Most people that use Google don't have a clue how it works. LLMs are actually well equipped to come up with solid plans for finding stuff. They can program, they know about different sources of information and how to access them. They can actually pick apart documentation written for humans and use that to write programs, etc. In other words, giving LLMs better search, which is something I know a bit about, is going to enable them to give better, more balanced answers. We've seen nothing yet.
What I like about this is that it doesn't require a lot of mystical stuff by people who arguably barely understand the emergent properties of LLMs even today. It just requires more system thinking. Smaller LLMs trained to search rather than to know might be better than a bloated know-it-all blob of neurons with the collective knowledge of the world compressed into it. The combination might be really good of course. It would be able to hallucinate theories and then conduct the research needed to validate them.
One big problem is that we've build search for humans, more specifically to advertise to them.
AI doesn't need a human search, it needs a "fact database" that can pull short factoids with a truth value, which could be a distribution based on human input. So for example, you might have the factoid "Donald Trump incited insurrection on January 6th" with a score of 0.8 (out of 1) with a 0.3 variance since people either tend to absolutely believe it or disbelieve it, with more people on the believing side.
Beyond that AI needs a "logical tools" database with short examples of their use that it can pull from for any given problem.
Given that the github account itself is valid, and that it has some other repositories related to ML, I suspect the link will be working "soon". It's likely a private repo while the paper is going through all the places the author needs it to before they can fully publish things. I've seen this a lot with pre-print papers in this space where the paper goes out first before they publish the code or other resources.
I do find myself reading papers often for my work, and I share the once I find interesting or feel might have impact in future of my chosen domain. This is no advertisement, I don't know the authors or anyone related to the paper.
My father was a PhD psychologist and family therapist. He was on the witness stand during a custody case explaining a theory of personality when the cross-examining lawyer said scornfully "I'll bet you got that out of some book." To which my dad replied: "Why yes, in fact. In my profession, in order to learn things, we often read books."
That being said, it is a bit frustrating that so much RAG research focuses on multi-hop approaches with LLMs. IME multiple round trips to an LLM is essentially a non-starter for any serious consumer product as it's far too slow. Smaller models can struggle to follow instructions so they often can't be an adequate replacement even for simpler tasks. Curious to hear if other folks working in this space have had any success thinking critically about these types of problems!