I think this misses the mark. We know LLMs can learn facts. There are lots of ot...

I think this misses the mark. We know LLMs can learn facts. There are lots of other benchmarks full of facts, and I don't expect that saturation of this benchmark will mean we have AGI.

The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.

I started a company that makes evals though, so I may be biased.