1. Go out every morning to work in your field.
2. When the Sun rises, make a note on the same fixed piece of wood, e.g., a fence.
3. Observe the leftmost and rightmost positions, these are your solstices.
4 You can now use your fence to identify and predict solstices.
I have read that the auspicious date of December 25th may have been intended to be the Solstice but that the degree of error for "making a note on a fence" is why we have the 25th.
The AI hate is unreasonably strong right now. People are acting like adding one feature they don't like or need to a browser is a borderline critical offense because it is an AI feature. I find it shocking how quickly the public in the US/EU developed this sort of hate towards one of the most interesting technology of the last decades.
Let's say you went to a library to find a book for a thesis. But instead the librarian instead on spinning tales and waste your time. It's fun when in a comedy show, but not so fun when you want to get something done. LLM technology is nice, but not everyone wants an hallucination machine, especially on their own computer. It would be another matter if Mozilla, Google, or Microsoft were offering free laptops.
It is interesting, but that's not the feature that people hate. They hate the monitoring, the power consumption, the inaccuracy, and the social and intellectual stupification.
I use LLMs quite frequently, but there are some places I do not want them. "Use AI to chat with your PDF!" The only thing I'd want to have it remotely touch in my browser is translations.
Not sure if serious... but just in case, very simply put...
DC pulls water out of local water supply.
DC uses evaporative cooling (not all use closed systems, and even those that do see some loss over time)
Water lost to cooling is now in the atmosphere.
If the DC (and other local users) withdraw water faster than local conditions allow it to be replenished, you end up without any local water.
of course not, but as far as i understand there are a few factors that are relevant for local water supplies:
- evaporation from cooling. the water will come down as rain again, but not necessarily in the same region
- when disposing the water into the sewers, the water might get "lost" into the oceans, where it's not available as drinking water
- when disposing water used for cooling into the rivers it was taken from, there might be environmental issues with water temperature. i know that this is an issue with rivers in europe where the industry is allowed to measure and report their adherence to the laws regarding the maximum allowed water temperatures themselves and, to no ones surprise, the rivers are too warm.
so water is not destroyed, but it can be made unusable or unavailable for the locally intended purpose.
They don't have any more juice left to squeeze at that front. The lack of new ideas in LMs is pretty palpable by now. There is a bunch of companies with billions invested in them that are all just looking at each other, trying to figure out what to do.
Both Anthropic and Google have clear directions. Anthropic is trying to corner the software developer market and succeeding, Google is doing deep integration with their existing products. There’s also Deepseek who seem hell bent on making the cheapest SotA models and supplying the models people can use for research on applications. Even Grok is fairly mission focused on with X integration.
I don't really buy this post. LLMs are still pretty weak at long contexts and asking them to find some patterns in data usually leads to very superficial results.
No one said you cannot run LLMs with the same task more than once. For my local tooling, I usually use the process of "Do X with previously accumulated results, add new results if they come up, otherwise reply with just Y" and then you put that into a loop until LLM signals it's done. Software-wise, you could add so it continues beyond that too, for extra assurance.
In general for chat platforms you're right though, uploading/copy-pasting long documents and asking the LLM to find not one, but multiple needles in a haystack tend to give you really poor results. You need a workflow/process for getting accuracy for those sort of tasks.
Then you have non-superficial results based even on long contexts. Wasn't it clear I was responding directly to parent's concerns? Could I have made it clearer?
What I was getting at is that running LLMs in a loop doesn't really address those concerns at all. You simply get more information, it doesn't do anything for the correctness of that information.
Any patterns it identifies could be imaginary, and you don't have any indication of confidence that all relevant trends have been identified. The most valuable patterns are likely going to be the most hidden and easily missed - those are the ones people are paid to find, not just the low hanging fruit that can be identified with basic statistics.
In the same way that asking an LLM to review some code might produce a hundred superficial comments and miss the things that actually matter.
As expected the LLM does understand what you're trying to say. A rational actor would use this opportunity to update their priors but something tells me you're not a rational actor. Pity.
This was on chatgpt, tell free to try for yourself, since you probably think it's fake.
Prompt:
Do you understand what user 2 is trying to say in below context?
User1: No one said you cannot run LLMs with the same task more than once
User2: no one said you can't turn on the radio and start listening to static
User2: Ask the LLM if it understands what I'm trying to say
Response:
Yes. User 2 is making an analogy, not a literal rebuttal.
What they’re trying to say is roughly this:
> Just because something is allowed or possible doesn’t mean it’s meaningful or useful.
More specifically:
User1’s claim:
“You can run an LLM on the same task more than once.”
→ This frames repeatability as neutral or acceptable.
User2’s analogy:
“You can turn on the radio and listen to static.”
→ Yes, it’s allowed — but it’s noise, not signal.
So User 2 is implying that:
Re-running an LLM on the same task (especially without controlling randomness, evaluation criteria, or epistemic grounding) may just produce stochastic variation rather than new information.
Treating multiple runs as meaningful evidence can be like mistaking static for content.
In epistemic terms, they’re criticizing procedural permissibility being confused with epistemic value. The analogy suggests:
In general, I find it not convincing to use patent-based metrics as the only argument. The parent writing culture might have changed in various ways that are not related to the quality of the ideas. This undermines the entire argument...
Indeed, the researchers undermined their own argument, even:
> ... it is notable that contrary to their main results, Fort et al. find that the stock market value of the average patent has actually fallen over time.
Their methodologies are very indirect and yield contradictory results.
Trying to decide if a patent is important by looking at the evolution of word use doesn't sound robust, nor does looking at the stock market. When Google invented the transformer algorithm, I don't think there was a sudden jump in their stock price. There are lots of papers and people can't evaluate their value immediately like that. Stock prices move in response to earnings, not patents or papers. I don't remember ever hearing about a sudden stock price jump because a patent was filed.
There's lots of other questionable stuff in this argument. How are they defining researcher, for one? For US tax purposes it's common to define all software development as R&D. If they're using similar data then the huge growth of the software industry would make it appear like research productivity has fallen.
I would say it was discovered, not invented. People were messing around with some algorithms, intruiged by their results. Eventually researchers discovered that with using certain training algorithm with certain data can lead to really wonderful outputs. But this is pure empirical discovery.
No AI researcher from 2010 would predict that transformer architecture (if we could send them the description back in time), SGD, and Web crawling could lead to a very coherent and useful LMs.
Yup. LLMs are a big statistical model, where also any sub-part doesn't know the whole. If it's really similar to a brain, I guess we might say we discovered it. But if it isn't, we invented it. The fact that it is so useful doesn't have to mean that "it arrived".
Not sure why that's contorting, a markov model is anything where you know the probability of going from state A to state B. The state can be anything. When it's text generation the state is previous text to text with an extra character, which is true for both LLMs and oldschool n-gram markov models.
A GPT model would be modelled as an n-gram Markov model where n is the size of the context window. This is slightly useful for getting some crude bounds on the behaviour of GPT models in general, but is not a very efficient way to store a GPT model.
I'm not saying it's an n-gram Markov model or that you should store them as a lookup table. Markov models are just a mathematical concept that don't say anything about storage, just that the state change probabilities are a pure function of the current state.
Yes, technically you can frame an LLM as a Markov chain by defining the "state" as the entire sequence of previous tokens. But this is a vacuous observation under that definition, literally any deterministic or stochastic process becomes a Markov chain if you make the state space flexible enough. A chess game is a "Markov chain" if the state includes the full board position and move history. The weather is a "Markov chain" if the state includes all relevant atmospheric variables.
The problem is that this definition strips away what makes Markov models useful and interesting as a modeling framework. A “Markov text model” is a low-order Markov model (e.g., n-grams) with a fixed, tractable state and transitions based only on the last k tokens. LLMs aren’t that: they model using un-fixed long-range context (up to the window). For Markov chains, k is non-negotiable. It's a constant, not a variable. Once you make it a variable, near any process can be described as markovian, and the word is useless.
Sure many things can be modelled as Markov chains, which is why they're useful. But it's a mathematical model so there's no bound on how big the state is allowed to be. The only requirement is that all you need is the current state to determine the probabilities of the next state, which is exactly how LLMs work. They don't remember anything beyond the last thing they generated. They just have big context windows.
The etymology of the "markov property" is that the current state does not depend on history.
And in classes, the very first trick you learn to skirt around history is to add Boolean variables to your "memory state". Your systems now model, "did it rain The previous N days?" The issue obviously being that this is exponential if you're not careful. Maybe you can get clever by just making your state a "sliding window history", then it's linear in the number of days you remember. Maybe mix the both. Maybe add even more information .Tradeoffs, tradeoffs.
I don't think LLMs embody the markov property at all, even if you can make everything eventually follow the markov property by just "considering every single possible state". Of which there are (size of token set)^(length) states at minimum because of the KV cache.
The KV cache doesn't affect it because it's just an optimization. LLMs are stateless and don't take any other input than a fixed block of text. They don't have memory, which is the requirement for a Markov chain.
Have you ever actually worked with a basic markov problem?
The markov property states that your state is a transition of probabilities entirely from the previous state.
These states, inhabit a state space. The way you encode "memory" if you need it, e.g. say you need to remember if it rained the last 3 days, is by expanding said state space. In that case, you'd go from 1 state to 3 states, 2^3 states if you needed the precise binary information for each day. Being "clever", maybe you assume only the # of days it rained, in the past 3 days mattered, you can get a 'linear' amount of memory.
Sure, a LLM is a "markov chain" of state space size (# tokens)^(context length), at minimum. That's not a helpful abstraction and defeats the original purpose of the markov observation. The entire point of the markov observation is that you can represent a seemingly huge predictive model with just a couple of variables in a discrete state space, and ideally you're the clever programmer/researcher and can significantly collapse said space by being, well, clever.
>Sure many things can be modelled as Markov chains
Again, no they can't, unless you break the definition. K is not a variable. It's as simple as that. The state cannot be flexible.
1. The markov text model uses k tokens, not k tokens sometimes, n tokens other times and whatever you want it to be the rest of the time.
2. A markov model is explcitly described as 'assuming that future states depend only on the current state, not on the events that occurred before it'. Defining your 'state' such that every event imaginable can be captured inside it is a 'clever' workaround, but is ultimately describing something that is decidedly not a markov model.
It's not n sometimes, k tokens some other times. LLMs have fixed context windows, you just sometimes have less text so it's not full. They're pure functions from a fixed size block of text to a probability distribution of the next character, same as the classic lookup table n gram Markov chain model.
1. A context limit is not a Markov order.
An n-gram model’s defining constraint is: there exists a small constant k such that the next-token distribution depends only on the last k tokens, full stop. You can't use a k-trained markov model on anything but k tokens, and each token has the same relationship with each other regardless. An LLM’s defining behavior is the opposite: within its window it can condition on any earlier token, and which tokens matter can change drastically with the prompt (attention is content-dependent). “Window size = 8k/128k” is not “order k” in the Markov sense; it’s just a hard truncation boundary.
2. “Fixed-size block” is a padding detail, not a modeling assumption.
Yes, implementations batch/pad to a maximum length. But the model is fundamentally conditioned on a variable-length prefix (up to the cap), and it treats position 37 differently from position 3,700 because the computation explicitly uses positional information. That means the conditional distribution is not a simple stationary “transition table” the way the n-gram picture suggests.
3. “Same as a lookup table” is exactly the part that breaks.
A classic n-gram Markov model is literally a table (or smoothed table) from discrete contexts to next-token probabilities. A transformer is a learned function that computes a representation of the entire prefix and uses that to produce a distribution. Two contexts that were never seen verbatim in training can still yield sensible outputs because the model generalizes via shared parameters; that is categorically unlike n-gram lookup behavior.
I don't know how many times I have to spell this out for you. Calling LLMs markov chains is less than useless. They don't resemble them in any way unless you understand neither.
I think you're confusing Markov chains and "Markov chain text generators". A Markov chain is a mathematical structure where the probabilities of going to the next state only depend on the current state and not the previous path taken. That's it. It doesn't say anything about whether the probabilities are computed by a transformer or stored in a lookup table, it just exists. How the probabilities are determined in a program doesn't matter mathematically.
Just a heads-up: this is not the first time somebody has to explain Markov chains to famouswaffles on HN, and I'm pretty sure it won't be the last. Engaging further might not be worth it.
I did not even remember you and had to dig to find out what you were on about. Just a heads up, if you've had a previous argument and you want to bring that up later then just speak plainly. Why act like "somebody" is anyone but you?
My response to both of you is the same.
LLMs do depend on previous events, but you say they don't because you've redefined state to include previous events. It's a circular argument. In a Markov chain, state is well defined, not something you can insert any property you want to or redefine as you wish.
It's not my fault neither of you understand what the Markov property is.
By that definition n-gram Markov chain text generators also include previous state because you always put the last n grams. :) It's exactly the same situation as LLMs, just with higher, but still fixed n.
We've been through this. The context of a LLM is not fixed. Context windows =/ n gram orders.
They don't because n gram orders are too small and rigid to include the history in the general case.
I think srean's comment up the thread is spot on. This current situation where the state can be anything you want it to be just does not make a productive conversation.
'A Markov chain is a mathematical structure where the probabilities of going to the next state only depend on the current state and not the previous path taken.'
My point, which seems so hard to grasp for whatever reason is that In a Markov chain, state is a well defined thing. It's not a variable you can assign any property to.
LLMs do depend on the previous path taken. That's the entire reason they're so useful! And the only reason you say they don't is because you've redefined 'state' to include that previous path! It's nonsense. Can you not see the circular argument?
The state is required to be a fixed, well-defined element of a structured state space. Redefining the state as an arbitrarily large, continuously valued encoding of the entire history is a redefinition that trivializes the Markov property, which a Markov chain should satisfy. Under your definition, any sequential system can be called Markov, which means the term no longer distinguishes anything.
> An n-gram model’s defining constraint is: there exists a small constant k such that the next-token distribution depends only on the last k tokens, full stop.
I don't necessarily agree with GP, but I also don't think that a markov chain and markov generator definitions include the word "small".
That constant can be as large as you need it to be.
QM and GR can be written as matrix algebra, atoms and electrons are QM, chemistry is atoms and electrons, biology is chemistry, brains are biology.
An LLM could be implemented with a Markov chain, but the naïve matrix is ((vocab size)^(context length))^2, which is far too big to fit in this universe.
Like, the Bekenstein bound means writing the transition matrix for an LLM with just 4k context (and 50k vocabulary) at just one bit resolution, the first row (out of a bit more than 10^18795 rows) ends up with a black hole >10^9800 times larger than the observable universe.
Yes, sure enough, but brains are not ideas, and there is no empirical or theoretical model for ideas in terms of brain states. The idea of unified science all stemming from a single ultimate cause is beautiful, but it is not how science works in practice, nor is it supported by scientific theories today. Case in point: QM models do not explain the behavior of larger things, and there is no model which gives a method to transform from quantum to massive states.
The case for brain states and ideas is similar to QM and massive objects. While certain metaphysical presuppositions might hold that everything must be physical and describable by models for physical things, science, which should eschew metaphysical assumptions, has not shown that to be the case.
Markov models with more than 3 words as "context window" produce very unoriginal text in my experience (corpus used had almost 200k sentences, almost 3 million words), matching the OP's experience. These are by no means large corpuses, but I know it isn't going away with a larger corpus.[1] The Markov chain will wander into "valleys" of reproducing paragraphs of its corpus one for one because it will stumble upon 4-word sequences that it has only seen once. This is because 4 words form a token, not a context window. Markov chains don't have what LLMs have.
If you use a syllable-level token in Markov models the model can't form real words much beyond the second syllable, and you have no way of making it make more sense other than increasing the token size, which exponentially decreases originality. This is the simplest way I can explain it, though I had to address why scaling doesn't work.
[1] There are 4^400000 possible 4-word sequences in English (barring grammar) meaning only a corpus with 8 times that amount of words and with no repetition could offer two ways to chain each possible 4 word sequence.
What do you mean? The states are fully observable (current array of tokens), and using an LLM we calculate the probabilities of moving between them. What is not MC about this?
I suggest getting familiar with or brushing up on the differences between a Markov Chain and a Markov Model. The former is a substantial restriction of the latter. The classic by Kemeny and Snell is a good readable reference.
MC have constant and finite context length, their state is the most recent k tuple of emitted alphabets and transition probabilities are invariant (to time and tokens emitted)
LLMs definitely also have finite context length. And if we consider padding, it is also constant. The k is huge compared to most Markov chains used historically, but it doesn't make it less finite.
What do you mean? I can only input k tokens into my LLM to calculate the probs. That is the definition of my state. In the exact way that N-gram LMs use N tokens, but instead of using ML models, they calculate the probabilities based on observed frequencies. There is no unbounded context anywhere.
You can certainly feed k-grams one at a time to, estimate the the probability distribution over next token and use that to simulate a Markov Chain and reinitialize the LLM (drop context). In this process the LLM is just a look up table to simulate your MC.
But an LLM on its own doesn't drop context to generate, it's transition probabilities change depending on the tokens.
reply