Yes, in the sense that they can be formalised as such. Given n tokens t1..tn, a ...

Yes, in the sense that they can be formalised as such. Given n tokens t1..tn, a transformer computes the probability p(t') that every token t' has of being the next. Each of these probabilities are arcs from a state t1..tn to another state t1..tn,t' weighted by p(t').

And since the context window is bounded, this is guaranteed to be finite-state, and all states are transient except those where the window is saturated.