Aren't transformers markov chains mathematically? You couldn't practically implement them as naive markov chains as the state space is too big, but the math is the same, isn't it?
Yes, in the sense that they can be formalised as such. Given n tokens t1..tn, a transformer computes the probability p(t') that every token t' has of being the next. Each of these probabilities are arcs from a state t1..tn to another state t1..tn,t' weighted by p(t').
And since the context window is bounded, this is guaranteed to be finite-state, and all states are transient except those where the window is saturated.