> OpenAI's flagship models are not even correct 50% of the time[1]
You're reading the link wrong. They specifically picked questions that one or more models failed at. It's not representative of how often the model is wrong in general.
From the paper:
> At least one of the four completions must be incorrect
for the trainer to continue with that question; otherwise, the trainer was instructed to create
a new question.
You're reading the link wrong. They specifically picked questions that one or more models failed at. It's not representative of how often the model is wrong in general.
From the paper:
> At least one of the four completions must be incorrect for the trainer to continue with that question; otherwise, the trainer was instructed to create a new question.