Hacker News new | past | comments | ask | show | jobs | submit | abdullin's comments login

Results from the Enterprise RAG Challenge, comparing 110 experiments from 43 teams on building Retrieval-Augmented Generation (RAG) systems. The challenge required AI solutions to automatically answer 100 complex queries across 100 large annual reports (one was over 1000 pages), including cross-document reasoning.

Teams detailed their architectures, methods, and lessons learned - expand rows in the table for insights into each approach.

Feedback welcome!

PS: There are even a few fully-local solutions in the leaderboard. Hopefully in the third round we'll have even more.


Interesting read, thank you!

Do you use any special tools to manage all these separate databases, track performance and debug problems?


The folks over at StarbaseDB (https://starbasedb.com/) are working on building tools for shareded SQLite.

From the companies I've talked to, most developers using this architecture are building quick scripts to do this in-house. Both Turso and Durable Objects SQLite already a surprising amount of usage that people don't talk about much publicly yet, so I suspect some of this tooling will start to be published in the next year.


This is an example of my prompt to o1 a few days ago. First request produced refactoring suggestions. Second request (yes, I like results, implement them) produced multiple files that have just worked.

—- Take a look at this code from my multi-mode (a la vim or old terminal apps) block-based content editor.

I want to build on the keyboard interface and introduce a simple way to have simple commands with small popup. E.g. after doing "A" in "view" mode, show user a popup that expects H,T,I, or V.

Or, after pressing "P" in view mode - show a small popup that has an text input waiting for the permission role for the page.

Don't implement the changes, just think through how to extend existing code to make logic like that simple.

Remember, I like simple code, I don't like spaghetti code and many small classes/files.


I would recommend giving a try to o1-preview in coding tasks like this one.

It is one level above Claude 3.5 Sonnet, which currently is the most popular tool among my peers.


I’m consulting multiple teams on shipping LLM-driven business automation. So far I have seen only one case where fine-tuning a model really paid off (and didn’t just blow up the RLHF calibration and caused wild hallucinations).

I would suggest to avoid training and look into RAG systems, prompt engineering and using OpenAI API for a start.

You can do a small PoC quickly using something like LangChain or LlamaIndex. Their pipelines can ingest unstructured data in all file formats, which is good for getting a quick feel.

Afterwards, if you encounter hallucinations in your tasks - throw out vector DB and embeddings into the trashcan (they are pulling junk information into the context and causing hallucinations). Replace embeddings with a RAG based on full text search and query expansion based on the nuances of your business.

If there are any specific types of questions or requests that you need special handling for - add a lightweight router (request classifier) that will direct user request to a dedicated prompt with dedicated data.

By that time you would’ve probably lost all of RAG, replacing it with a couple of prompt templates, a file based knowledge base in markdown and CSV and a few helpers to pull relevant information into the context.

That’s how most of working LLM-driven workflows end up (in my bubble). Maybe just with PostgreSQL and ES instead of file-based knowledge base. But that’s an implementation detail.

Update: if you really want to try fine-tuning your own LLM - this article links to a Google Collab Notebook for the latest Llama 3.1 8B: https://unsloth.ai/blog/llama3-1

It will not learn new things from your data, though. Might just pick up the style.


>throw out vector DB and embeddings into the trashcan (they are pulling junk information into the context and causing hallucinations)

Not sure why this would be true. In my experience, semantic search based on a vector index/embeddings pulls in more relevant information than a full-text keyword search. Maybe there is too broad a set of materials in your vector db, or the chunking strategy isn't good?


It might depend on the case.

My problem with similarity search - it is unpredictable. It can sometimes miss really obvious matches or pull completely irrelevant snippets. When this happens - this causes downstream hallucinations that are hard to fix.

My customers don’t tolerate hallucinations.

Query expansion with FTS search works more predictably for me. Especially, if we factor in search scope reduction driven by the request classifier (“agent router”)


For sure it will depend on use case, if you have fairly structured data or a clear domain-specific terminology to rely on, there's probably no reason to use semantic search.

>Query expansion with FTS search works more predictably for me. Especially, if we factor in search scope reduction driven by the request classifier (“agent router”)

You might be able to quantify this and gain some insight into why query expansion/FTS is working better by comparing the precision/recall with a vector db using some set of benchmark docs and queries.


> For sure it will depend on use case, if you have fairly structured data or a clear domain-specific terminology to rely on

Indeed. This works only in a subset of business domains for me: search and assistants within enterprise knowledge base (e.g. ~40k documents with 20GB of text) within logistics, supply chain, legal, fintech and medtech.

> You might be able to quantify this and gain some insight into why query expansion/FTS is working better by comparing the precision/recall with a vector db using some set of benchmark docs and queries.

Embeddings tend to miss a lot of nuances, plus they are just unpredictable when searching on large sets of text (e.g. 40k documents fragmented), frequently pulling irrelevant texts before the relevant ones. Context contamination leads to hallucinations in our cases.

However with LLM-driven query expansion and FTS search I can get controllable retrieval quality in business tasks. Plus, if something edge case shows up, it is fairly easy to explain and adjust the query expansion logic to cover specific nuances.

This is the setup I'm happy with.


Doesn't RAG approach (and LangChain) both require you send the context data (ie: your book data) in the prompt query api call? How would you fit 20,000 books in that call?


It is impossible to fit all that information into the call.

The whole point of RAG - we (somehow) retrieve only the relevant information and put it into the context to generate the answer.


Plus one for this approach, I’m trying to say broadly the same thing with my comment.

What are you using for full text search RAG in production?


It really depends on the setup that the dev/ops at customer are more comfortable with. Elastic or PostgreSQL can be both fine.

Personally for small cases (e.g. under 50k documents and 20GB of text) I like to use SQLite FTS, while linking text fragments to the additional metadata (native or extracted). This way I can really narrow down the search scope to a few case related documents in each conversation path.

But ultimately the flavor of DB and FTS is just an implementation detail. Most of them will do just fine.

Edit: fixed grammar.


I’ve been building LLM-driven systems for customers for quite some time. We got tired of hallucinations from vector-based and hybrid RAGs last year, eventually arriving to the approach similar to yours.

It is even called Knowledge Mapping [1]. It works really well, and customers can understand it.

Probably the only difference with your approach is that we use different architectural patterns to map domain knowledge into bits of knowledge that LLMs will use to reason (Router, Knowledge Base, Search Scope, Workflows, Assistant etc)

My contacts are in the profile, if you want to bounce ideas!

[1] English article: https://www.trustbit.tech/en/wie-wir-mit-knowledge-maps-bess...


Most of the blame that LLMs get is because they are used wrong.

LLM is a good information transformation engine. But if I supply it with messy or wrong information in the context, then I’ll end up with hallucinations.

This is why most of the AI systems tend to produce junk. LLM part is good, but the underlying system just provides it with noisy and meaningless data.


It all depends on the benchmark and the use case.

If we are talking about the ability of models to follow instructions and carry out concrete tasks (as in products or inside RAG systems), then Gemini Pro 1.5 is currently on the eighth place in our benchmark.

Academic benchmarks, HF Leaderboards or LMSYS Chat arena will have different numbers.


> It all depends on the benchmark and the use case.

That's why I have my own set of simple benchmarks that I'm not going to publish. Everybody can easily prepare such a set - in my case they are various programming tasks that should generate determined output. It is not easy to automatically qualify the quality of code, but at the very least you can filter the results by invalid outputs. With reasonably high number of tasks and their complexity, this can be a fair estimator - provided that it's never published publicly.


Yep, exactly!

My approach is similar - closed source benchmarks with prompts and tests from real LLM-driven products (mostly around boring business automation and enterprise workflows).

Although it would be neat to upgrade the setup to work on the synthetic data. This will at least make benchmarks shareable publicly (not just the results)


There are too many variables at play, unfortunately.

One can ran local LLMs even on RaspberryPi, although it will be horribly slow.


Maybe it wouldn’t be an algorithm, maybe it would be a reporting site where you can review your experience if there’s no way to calculate it.


LocalLLaMA subreddit usually has some interesting benchmarks and reports.

Here is one example, testing performance of different GPUs and Macs with various flavours of Llama:

https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...


Yes, this can work. I’ve done that in a few cases.

In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: