This isn't demonstrated yet, I would say. A good analogy is how people have used NeRFs to generate Doom levels, but when they do, the levels don't have offscreen coherence or object permanence. There's no internal engine behind the scenes making an actual Doom level. There's just a mechanism to generate things that look like outputs of that engine. In the same way, an LLM might well just be an empty shell that's good at generating outputs based on similar-looking outputs it was trained on, rather than something that can do the work of thinking about things and producing outputs. I know that's similar to "statistical parrot", but I don't think what you're saying demonstrates anything more than that.
It can be trivially demonstrated with a unique problem that doesn’t exist in the training data and an answer that is correct and has a low probability of being arrived at without reasoning.
This isn't demonstrated yet, I would say. A good analogy is how people have used NeRFs to generate Doom levels, but when they do, the levels don't have offscreen coherence or object permanence. There's no internal engine behind the scenes making an actual Doom level. There's just a mechanism to generate things that look like outputs of that engine. In the same way, an LLM might well just be an empty shell that's good at generating outputs based on similar-looking outputs it was trained on, rather than something that can do the work of thinking about things and producing outputs. I know that's similar to "statistical parrot", but I don't think what you're saying demonstrates anything more than that.