How much of that is just the flood of traditional engineers into the space and the fact that collecting data and then fine-tuning models is orders of magnitude more complex than just throwing in RAG? I suspect a huge amount of RAG's popularity is just that any engineer can do a version of it + ChatGPT API calls in a day.
As for lora - in the context of my comment, that's just splitting hairs IMO. It falls in the category of finetuning for me, although I understand why you might disagree. But it's not like the article mentions lora either, nor am I aware of people doing lora without GPUs which the article is against (No GPUs before PMF)
I disagree. No amount of fine tuning will ever give the LLM the relevant context with which to answer my question. Maybe if your context is a static Wikipedia or something that will never change, you can fine tune it. But if your data and docs keep changing, how is fine tuning going to be better than RAG?
Luckily it's not one or the other. You can fine tune and use RAG.
Sometimes RAG is enough. Sometimes fine tuning on top of RAG is better. It depends on the use case. I can't think of any examples where you would want to fine tune and not use rag as well.
Sometimes you fine tune a small model so it performs close to a larger varient on that specific narrow task and you improve inference performance by using a smaller model.
Continuous retraining and deployment maybe? But I'm actually not anti-RAG (although I think it is overrated because the retrieval problem is still handled extremely naively), I just think that fine-tuning should also be in your toolkit.
Why is the retrieval part overrated? There isnt even a single way to retrieve. It could be a simple keyword sesrch, a vector sesrch, a combo, or just simply retrieving a single doc and stuffing it in the context
People will disagree, but my problem with retrieval is that every technique that is popular uses one-hop thinking - you retrieve information that is directly related to the prompt using old-school techniques (even though the embeddings are new, text similarity is old). LLMs are most powerful, IMO, at horizontal thinking. Building a prompt using one-hop narrow AI techniques and then feeding it into a powerful generally capable model is like building a drone but only letting it fly over streets that already exist - not worthless, but only using a fraction of the technology's power.
A concrete example is something like a tool for querying an internal company wiki and the query "tell me about the Backend Team's approach to sprint planning". Normal retrieval approaches will pull information directly related to that query. But what if there is no information about the backend team's practices? As a human, you would do multi-hop/horizontal information extraction - you would retrieve information about who makes up the backend team, you would then retrieve information about them and their backgrounds/practices. You might might have a hypothesis that people carry over their practices from previous experiences, so you look at the previous teams and their practices. Then you would have the context necessary to give a good answer. I don't know of many people implementing RAG like that. And what I described is 100% possible for AI to do today.
Techniques that would get around this like iterative retrieval or retrieval-as-a-tool don't seem popular.
People cant do that because of cost. If every single query involved taking everything even remotely related to the query, and passing it to OpenAI, it would get expensive very very fast.
Its not a technical issue, its a practicality issue imo.
That's very true. Although it is feasible if you are well-resourced and make the investment to own the toolchain end-to-end. Serving costs can be quite low (relatively speaking) if you control everything. And you have to pick the correct problem where the cost is worthwhile.
As for lora - in the context of my comment, that's just splitting hairs IMO. It falls in the category of finetuning for me, although I understand why you might disagree. But it's not like the article mentions lora either, nor am I aware of people doing lora without GPUs which the article is against (No GPUs before PMF)