It might seem like you could sort with just pairwise correlations, but on closer analysis, you cannot. Generating the next correct token requires correctly weighing the entire context window.
I mean that needing to scan the full context of tokens before the nth is inherent to the problem of sorting. Transformers do scan that input, which is good; it's not surprising that they're up to the task. But pairwise numeral correlations will not do the job.
As for avoiding certain cases, that could be done to some extent. But remember that the untrained transformer has no preconception of numbers or ordering (it doesn't use the hardware ALU or integer data type) so there has to be enough data in the training set to learn 0<1<2<3<4<5<6, etc.
> there has to be enough data in the training set to learn 0<1<2<3<4<5<6
This is the kind of thing I’d want it to generalize.
If I avoid having 2 and 6 in the same unsorted list in the training set, will sets containing those numbers be correctly sorted in the same list in the test set and at the same rate as other lists.
My intuition is that, yes, it would. But it’d be nice to see and would be a clear demonstration of the ability to generalize at all.