If we are talking about the ability of models to follow instructions and carry out concrete tasks (as in products or inside RAG systems), then Gemini Pro 1.5 is currently on the eighth place in our benchmark.
Academic benchmarks, HF Leaderboards or LMSYS Chat arena will have different numbers.
> It all depends on the benchmark and the use case.
That's why I have my own set of simple benchmarks that I'm not going to publish. Everybody can easily prepare such a set - in my case they are various programming tasks that should generate determined output. It is not easy to automatically qualify the quality of code, but at the very least you can filter the results by invalid outputs. With reasonably high number of tasks and their complexity, this can be a fair estimator - provided that it's never published publicly.
My approach is similar - closed source benchmarks with prompts and tests from real LLM-driven products (mostly around boring business automation and enterprise workflows).
Although it would be neat to upgrade the setup to work on the synthetic data. This will at least make benchmarks shareable publicly (not just the results)
If we are talking about the ability of models to follow instructions and carry out concrete tasks (as in products or inside RAG systems), then Gemini Pro 1.5 is currently on the eighth place in our benchmark.
Academic benchmarks, HF Leaderboards or LMSYS Chat arena will have different numbers.