Not only that but also LLMs "think" in a latent representation that is several l...

Not only that but also LLMs "think" in a latent representation that is several layers deep. Sure, the first and last layers make it look like it is doing token wrangling, but what is happening in the middle layers is mostly a mystery. First layer deals directly with the tokens because that is the data we are observing (a "shadow" of the world) and last layer also deals with tokens because we want to understand what the network is "thinking" so it is a human specific lossy decoder (we can and do remove that translator and plug the latent representations to other networks to train them in tandem). There is no reason to believe that the other layers are "thinking in language".