I think this misses the mark. We know LLMs can learn facts. There are lots of other benchmarks full of facts, and I don't expect that saturation of this benchmark will mean we have AGI.
The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.
I started a company that makes evals though, so I may be biased.
The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.
I started a company that makes evals though, so I may be biased.