It all depends on the benchmark and the use case. If we are talking about the ab...

benterix · 2024-05-26T12:35:39 1716726939

> It all depends on the benchmark and the use case.

That's why I have my own set of simple benchmarks that I'm not going to publish. Everybody can easily prepare such a set - in my case they are various programming tasks that should generate determined output. It is not easy to automatically qualify the quality of code, but at the very least you can filter the results by invalid outputs. With reasonably high number of tasks and their complexity, this can be a fair estimator - provided that it's never published publicly.

abdullin · 2024-05-26T12:40:16 1716727216

Yep, exactly!

My approach is similar - closed source benchmarks with prompts and tests from real LLM-driven products (mostly around boring business automation and enterprise workflows).

Although it would be neat to upgrade the setup to work on the synthetic data. This will at least make benchmarks shareable publicly (not just the results)