> This is a reminder that benchmarks are meaningless – you should always curate ...

Archer6621 · 2025-11-19T07:25:17 1763537117

I wonder whether it could be related to some kind of over-fitting, i.e. a prompting style that tends to work better with the older models, but performs worse with the newer ones.

adastra22 · 2025-11-18T16:57:27 1763485047

I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.

Iulioh · 2025-11-18T16:22:28 1763482948

A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....

GPT4/3o might be the best we will ever have