> OpenAI's flagship models are not even correct 50% of the time[1] You're readin...

> OpenAI's flagship models are not even correct 50% of the time[1]

You're reading the link wrong. They specifically picked questions that one or more models failed at. It's not representative of how often the model is wrong in general.

From the paper:

> At least one of the four completions must be incorrect for the trainer to continue with that question; otherwise, the trainer was instructed to create a new question.