What is the conceptual difference between a "past state" and a "present state"? In a typical conversation with an LLM, you have past messages that remain in the KV cache when generating the next token (so as not to recalculate everything). Yet here, everything that came before is reinterpreted as part of the present state. In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain? I can't wrap my head around it. Can you give an example of a non-Markov system?
The difference is arbitrary but fixed. For any given Markov model, there are inputs that can fully fit in the present state and inputs that can't. But for any given input, there are models that can encode it entirely in the present state.
Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.
> In other words, it seems as if you can take any past state and fold it into one large concatenated present state
These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.
> Can you give an example of a non-Markov system?
Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.
If you include the encoder outputs as part of the state, then encoder-decoder LLMs are Markovian as well. While in token space, decoder-only LLMs are not Markovian. Anything can be a Markov process depending what state you include. Humans, or even the universe itself are Markovian. I don't see what insight about LLMs you and other commenters are gesturing at.
A common distinction is that whatever you take to be your state space should be fixed-sized. So an LLM has a fixed context window, which might be extremely large, but with a long enough conversation, the oldest tokens will fall out of the context window, and your process will be non-Markovian. But with a large context size, most conversations will be small enough that they fit entirely within the context, and then the process is Markovian.
So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.
I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.
Sorry, I realized I didn't quite write what I meant to. I didn't intend to say that LLMs are non-Markovian from a theoretical standpoint. I meant to say that the language generation task is non-Markovian from a theoretical standpoint, because the next word can depend on arbitrarily distant history.
> In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain?
Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.
Hmm... For example, say you have a machine that must decide whether a monetary transaction is allowable or not. The state at any given time could be the balances of all accounts, plus some metadata about past states (but not the past states themselves). Then if A has $5 in their account and tries to transfer $5 to B, the contents of the metadata would be the deciding factor of whether the transaction goes through or not. The machine still only operates on the present state, but the present state depends indirectly on past states.
But a conversation with an LLM doesn't only depend on the last message, it depends on all previous messages. Sure the number of messages is limited by the context window, so you can indeed say it's a high-order Markov chain (it can only see N states back), but theoretically, what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.
You appear to be confusing the UI over the LLM with the actual working of the LLM. None of them allow for more than one message to exist. You put in one message and you get one word (well, one token) back.
I have an idea of how they work. At the end of the day, it's all just numbers. You say you put "one message", but in reality it's N input numbers corresponding to the tokens (where N can very from 1 to any size). Basically N variables. An LLM is one large function which takes N values as input and produces a new output value. So you can view an LLM as y = f(a, b, c, ...)
What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?
As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.
Let's keep it simple and say that the procedure of an LLM is
state: list[n];
while (true){
new_token = llm(state);
state.pop_front();
state.push_back(new_token);
}
Any given time that llm() is called, it is blind to previous invocations. The state may contain evidence of them, or it may not. It's a free-form data structure that is sometimes filled with user input and sometimes contains previous llm() outputs, but llm() doesn't care.
So any pure function is Markovian then? They too only depend on their input and are blind to previous invocations. I’m just trying to understand the practical purpose of calling something a Markov chain beyond small sizes, because if we allow the state to be of any size, I’m not sure what the usefulness is: anything can be called a Markov chain if you concatenate all input data of any size into one blob and interpret it as "one state."
I guess it would be more accurate to say that at its core a Markovian process is based around a pure function. Usually there's some kind of feedback loop around the function that allows the process to evolve in some way and continue producing output.
>I’m pretty sure anything can be modeled using pure functions.
We're not talking about theoretical models of black boxes, though, we're talking about known mechanisms. A recursive function doesn't become iterative just because it could be thought of as iterative (because eventually the CPU does loop over some sequence of instructions). We still say it's recursive.
>I’m not sure what the usefulness is
I suppose part of the usefulness is to de-mystify LLMs. If you can understand what Markov chains do, the thought "oh. An LLMs is just a much more complex Markov chain" can help reasoning about them.
If you include the entire past history as part of the state, then any stochastic process becomes a Markov process.
Any computer program (without network stuff or inputs for a time (assume your computer has a true RNG module)) you could run is a Markov process. Just say that your state space is the space of possible ways the memory of the computer can be.
Saying that the LLM is “just” a Markov process seems to be doing something a bit weird with that “just”.
>Any computer program (without network stuff or inputs for a time (assume your computer has a true RNG module)) you could run is a Markov process. Just say that your state space is the space of possible ways the memory of the computer can be.
No. The entire computer is a Markov process in that situation. An individual process may still remember its past states and therefore not be Markovian. See the example I gave about allowing or blocking transactions.
By “memory” I was including whatever is on disk, and the values of all the CPU registers, and all the CPU flags, etc.
So, yeah, the state of the computer (regarded as an abstract machine, not viewed at the level of the physical hardware)
Oh, you mean because I said a particular program, and other things on the computer could interfere with the program. Ok, point. I was imagining a computer that just had a single program running on it (with no separate OS) I guess? But yeah, good point, I certainly didn’t make my idea precise enough (in my own mind or in my words).
Yes, I understood what you meant and my answer assumes that the contents of all memory -- volatile and persistent -- constitute the state of the machine.
It doesn't matter whether the program is the only thing the computer is doing or whether there are other things running on the same machine. The program considered as a machine unto itself is a conceptually different thing from the underlying hardware. The Markov property is a way to reason about how a process interacts with its state. "I could design a Markov process that's semantically equivalent to a non-Markov process, by just including more state" is a pointless statement. The whole point of the discrimination is to reason about the thing you're studying.
>what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.
Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.
The point being made is equivalent to pointing out that by strict mathematical definition there aren't physically reliazable distinguishably-non-Markovian processes, because computers are finite objects. Yes, you can talk about Markov chains as referring to everything up to and including actual human brains, but the kind of Markov chains people actually refer to as Markovian in practice are much more specific than that, typically those systems with structurally simple, indexical nodes, but sometimes extending to more complex ideas like RNNs where the states are at least of finite bandwidth. An LLM not only has continuous-valued nodes, it has nodes whose description length grows proportionally to and theoretically-unboundedly with respect to the size of its history.