These results are extremely impressive and encouraging, but also remember:

> Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors).

That's a quote from this announcement. As these models get more and more capable, it's going to become more and more important that we understand when and how they fail. Right now, it seems like we have very little insight into that. It feels more or less random. But that won't fly when these models are asked to do actually important things. And we'll undoubtedly be tempted to make them do those things as their output gets better.