> The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws.
Just based on the comparisons linked in the article, it's not "co-state-of-the-art", it's the clear leader. You might argue those numbers are wrong or not representative, but you can't accept them then claim it's not outperforming existing models.
The leader, perhaps, but not by a large margin, and only on these sample benchmarks. "Co-state-of-the-art" is the term used in the article, and I'm going to take that at face value.
The deltas between the others are mostly not significant either. They're all about equally good. There's no categorical difference between GPT-4 and Claude 3.5.
Just based on the comparisons linked in the article, it's not "co-state-of-the-art", it's the clear leader. You might argue those numbers are wrong or not representative, but you can't accept them then claim it's not outperforming existing models.