I'd be interested in hearing people's takes on the simplest mathematical reason that transformers are better than/different from fully connected layers. My take is:
Q = W_Q X
K = W_K X
A = Q^T K = (X^T W_Q^T) (W_K X) = X^T (...) X
Where A is the matrix that contains the pre-softmax, unmasked attention weights. Therefore, transformers effectively give you autocorrelation across the column vectors (tokens) in the input matrix X. Of course, this doesn't really say why autocorrelation would be so much better than anything else.
It’s a perception problem, as are most things on the edge of mathematics and computing. Displays are built to be visible to human eyes, data is structured to be perceivable to our minds… often we never see the “math” a program does to produce the GUI or output we interact with.
Sounds interesting, but I'm really asking more of a technical question here than a philosophical one. Your comment seems a bit more high level than what I'm going for.