Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The ‘with evidence’ part is key as simonw said. One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model.

Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: