That was a pretty deeply flawed paper, one of the largest drops recorded was sim...

That was a pretty deeply flawed paper, one of the largest drops recorded was simple parsing errors in their testing:

https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...

Overall evals and pinning against checkpoints are how you avoid those worries, but in general, if you solve a problem robustly, it's going to be rare for changes in the LLM to suddenly break what you're doing. Investing in handling a wide range of inputs gracefully also pays off on handling changes to the underlying model.