I find it somewhat interesting that there is a common perception about GPT-4 at release being actually smart, but that it got gradually nerfed for speed with turbo, which is better tuned but doesn't exhibit intelligence like the original.
There were times when I felt that too, but nowadays I predominantly use turbo. It's probably because turbo is faster and cheaper, but in lmsys turbo has 100 elo higher than original, so by and large people simply find turbo to be....better?
Nevertheless, I do wonder if not just in benchmarks but in how people use LLMs, intelligence is somewhat under utilised, or possibly offset by other qualities.
Given the incremental increase between GPT-4 and its turbo variant, I would weight “vibes” more heavily than this improvement on MMLU. OpenAI isn’t exactly a very honest or transparent company and the metric is imperfect. As a longtime time user of ChatGPT, I observed it got markedly worse at coding after the turbo release, specifically in its refusal to complete code as specified.
Have you tried Claude 3 Opus? I've been using that predominantly since release and find it's "smarts" as or better than my experience with GPT-4 (pre turbo).
I did. It definitely exudes more all around personality. Unfortunately in my private test suite (mostly about coding), it did somewhat worse than turbo or phind 70b.
Since price influences my calculus, I can't say this for sure, but it seems being slightly smarter is not much of an edge, because it's still dumb by human standards. For most non-coding use the smart doesn't make much difference (like summarisation), I find that cheaper options like mistral-large do just as good as Opus.
In the last month I have used Command R+ more and more. Finally had some excuse to write some function calling stuff. I have also been highly impressed by Gemini Pro 1.5 finding technical answers from a dense 650 page pdf manual. I have enjoyed chatting with the WizardLM2 fine-tune for the past few days.
Somehow I haven't quite found a consistent use case for Opus.
i think it might just be the subjective feelings (GPT-4-turbo being dumber) - the joy is always stronger when you first taste it, and the joy decays as you get used to it and the bar raises ever since.