Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That was a pretty deeply flawed paper, one of the largest drops recorded was simple parsing errors in their testing:

https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...

Overall evals and pinning against checkpoints are how you avoid those worries, but in general, if you solve a problem robustly, it's going to be rare for changes in the LLM to suddenly break what you're doing. Investing in handling a wide range of inputs gracefully also pays off on handling changes to the underlying model.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: