Attention mechanism, the core of LLM, is universal enough to be brought back to standard vision models. Which is kind of ironic, since vision models were dominated by convolutions, and, the transformer is dubbed "convolution for text".
The real reason is that it doesn't deteriorate with regards to the input length in case of text, or far neighbourhood in case of vision. It's just a universal, new, building block that allows for shallower neural networks to perform more like their bigger versions