LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.