Have you considered that the nature of numeric characters is just so predictable that they can be sorted without actually understanding their numerical value?
I mean that maybe gradient descent is a passable sorting algorithm, once the weights have been learned to properly describe ordering. It may be a speciality of transformers that they can sort things well. Which wouldn’t tell us that much about whether they are mentalists or not.