Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: