Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.