Lisp macros allow a general solution to this that doesn't just handle chained collection operators but allows you to decide the order in which you write any chain of calls.
For example, we can write:
(foo (bar (baz x))) as
(-> x baz bar foo)
If there are additional arguments, we can accommodate those too:
(sin (* x pi) as
(-> x (* pi) sin)
Where expression so far gets inserted as the first argument to any form. If you want it inserted as the last argument, you can use ->> instead:
(filter positive? (map sin x)) as
(->> x (map sin) (filter positive?))
You can also get full control of where to place the previous expression using as->.
I find the threading operators in Clojure bring much joy and increase readability. I think it's interesting because it makes me actually consider function argument order much more because I want to increase opportunities to use them.
Yeah, I found this when I was playing around with Hy a while back. I wanted a generic `->` style operator, and isn't wasn't too much trouble to write a macro to introduce one.
That's sort of an argument for the existence of macros as a whole, you can't really do this as neatly in something like python (although I've tried) - I can see the downside of working in a codebase with hundreds of these kind of custom language features though.
Yes threading macros are so much nicer than method chaining, because it allows general function reuse, rather than being limited to the methods that happen to be defined in your initial data object.
Adam Smith would be considered a right wing economist today, similar to Milton Friedman, perhaps with more support for certain government functions such as education. However, even on that issue, given the wildly different level of education available today, its not easy to predict what he would think.
Often quotations from him are taken wildly out of context to support a 'progressive' viewpoint. An example (From David Friedman's blog):
"People of the same trade seldom meet together, even for merriment and diversion, but the conversation ends in a conspiracy against the public, or in some contrivance to raise prices.”
But, if we keep reading:
It is impossible indeed to prevent such meetings, by any law which either could be executed, or would be consistent with liberty and justice. But though the law cannot hinder people of the same trade from sometimes assembling together, it ought to do nothing to facilitate such assemblies, much less to render them necessary.
Adam Smith was a proponent of free trade, free markets, and flat taxation (proportional to income). Not very socialist.
Similar argument to the article, but in much more depth.
In a nutshell, the UK has made it legally difficult to construct new housing and many forms of infrastructure- electricity plants, roads/tunnels, railway, hospitals, etc.
As a result, these things take up a larger fraction of ordinary people's budget and also limit mobility and hence productivity, resulting in poverty or effective poverty even for someone with an income that would make them globally rich.
The number of correct digits in the answer grows exponentially with iteration count- "quadratic convergence" means we roughly square the error each time, so after many iterations we raise the error to an exponentially high power.
Educational spending per pupil has gone up in real terms, by about 2.5x over 40 years. This is already inflation adjusted so there isn't any comparison to the inflation rate. I'd say a 2.5x increase is pretty dramatic!
I'm not "seeing" a neo-segregationist about school funding policy lol nice try though.
But anyway hasn't the GDP and accumulated wealth of the nation increased much more over that length of time? How much of that new wealth is it appropriate to spend on education?
It might be appropriate to spend a lot on education, if there were a benefit! But as you can see in the first chart, its not at all obvious that all this increased spending has had any measurable positive effect.
I also find there are benefits to spending (on my) education. But if we can't show benefit for a specific kind of public spending, is it justified to take other's money and spend it on "education" in a way which has no measurable impact? I don't think so.
The average of its errors is 0- ie it is not biased. Ofc, the average absolute error, which in English one could very reasonably refer to as "average error" is much greater than 0.
That may technically be correct, but it is incorrect in the real world. I submit that error is error in the real world. Mathematics can go jump off a cliff unless it wants to be helpful. :)
LLMs are nothing like Markov chains I've come across or worked with. I've never used a markov chain trained with anywhere near the number of states you'd need.
This feels like saying "it's just if statements". So yeah, but also thats a terrible mental model for how to use them.
Any system with memoryless finite transition rule (the Markov property) and full observability of the state is a Markov chain. In language models the "traditional" Markov chain model is the very simple small-context table lookup, but in many other fields they can be a lot more complicated. E.g. Markov Chain Monte Carlo and many statistical mechanics applications.
It's mathematically well understood and very poweful abstraction which is lost if they're thought only as the simple toy language model.
> The term is being tortured beyond endurance by being forced to describe what LLMs do.
No it isn't, this is exactly the kind of process that Markov chains describe. They describe a stateless probability function that you chain over and over to produce a sequence of results, exactly like what an LLM does, they predict the next token based on precious tokens repeatedly to produce text.
There is no bending of any concepts here, anyone who studied a higher level math probability course will instantly recognize LLMs as a markov chain. It is common for markov chains to have infinite number of states, like a real position, and then moving randomly from there. You can model particles in a gas like that.
> Calling an LLM a "Markov model" reveals an agenda rather than an insight.
What agenda? Knowing that LLM's are markov chains helps me understand and makes me better at using them, since I know every word is just calculated probabilistically based on the previous N words, the fastest way to describe that is to say it is a Markov chain. This makes it hard for them to coordinate their early words with things that will happen later, all that from just knowing it is a Markov chain.
I feel the knee jerk rejection of this statement is more of an agenda, why do you feel it is so wrong to say it is a Markov chain? Are you sure there is no agenda behind those feelings?
I don't think "Markov chain" is a useful description despite agreeing it's an accurate one, simply because I have a preference for applied maths over pure maths and a full transition table version would be obscenely large.
In this, it's also technically accurate to say that a perceptron network with one hidden layer can approximate any function: technically yes, but not in any practical sense.
But if you're working in the pure maths domain, knowing that it can in principle be one, means you can apply any tools that only work on Markov chains. Given what else I've seen mathematicians do, I won't be surprised if pure maths researchers can get useful results even though the transition matrix is literally too large to fit into our universe without collapsing it into a black hole — it would hardly be the first time I've even seen mathematicians talking about quantities that large.
The theory of Markov chains (or Markov processes in general) has nothing to say about what happens inside the function that maps observations to predictions of the next state, as long as the function is stateless/memoryless.
It is a different level of analysis than neural network architecture and covers a very wide types of models. The Markov chain structure of a model allows for some kinds of understanding of behavior of the models fulfilling its assumptions, e.g. finite range of dependencies, existence of a stationary distribution and crucially for transformer-type architectures that they can be trained in a purely parallel manner.
The problem I have is that anything deterministic can be described as Markovian.
If an algorithm employing 8 billion parameters addressed by a context that is itself cross-linked with a key-value store qualifies as "memoryless" enough to be a Markov model, does the term have any meaning at all? You could argue that it's memoryless because the embeddings aren't modified on the fly based on the user's input, I suppose.
My point is that describing a toy Markov text generator with the same terms that you apply to GPT3.5 is, even if technically accurate, no more meaningful than using the term "Turing machine" to describe a C-64 and a Cray.
While this is absolutely correct when the RNN has ℝeal-valued internal state, in concrete implementations the internal state in an RNN has some number of bits and one can draw a Markov graph with a group of 2^n variants of states to represent all possible states of n bits in some part of the system.
Likewise the context window of an LLM can be considered a state with some number of bits. It's just more explicit in an LLM's architecture vs. a RNN's runtime that the number of bits is finite.
> My point is that describing a toy Markov text generator with the same terms that you apply to GPT3.5 is, even if technically accurate, no more meaningful than using the term "Turing machine" to describe a C-64 and a Cray.
It’s not a particularly useful mental model, but it does drive home a critical point some people have trouble with, namely that LLMs don’t learn when being used.
But the context length is around what a human goes through in a day so it can feel like it, and they can use tools such as databases to make direct record of "important" things, and fine-tuning on sessions is also already possible.
I'm going to ignore "proper" continuous learning until that actually gets into the big-name models, because although there's plenty of "first 90%" tech demos, the Tesla FSD is also at that point and yet the cars today still come with steering wheels. Meanwhile, here's the research SOTA: https://github.com/Wang-ML-Lab/llm-continual-learning-survey
Ironically, continuously updating a Markov chain is easy, given how small most of them are in practice. But a Markov chain large enough to act like an LLM would collapse into a black hole 7x10^32 times larger than the universe*, so that's kinda hard to update.
> But a Markov chain large enough to act like an LLM would collapse into a black hole
You mean a full state table Markov Chain? Markov chains doesn't need to have all states in a table, it just needs to be a function, a table is one implementation of a function but anything can do.
Although my understanding is that under quantum mechanics it's necessary to set the probability of all transitions between states to a number larger than zero, I also understand that for practical purposes the probability for the overwhelming majority of transitions are close enough to zero as to require fancy notation (at the very least, nested exponentiation) to express quite how unlikely they are.
While it is obviously trivially (in the mathematical sense of "trivially") possible to make a Markov chain that produces the same f(input context window) => {output distribution}, that doesn't make it useful to describe LLMs as a kind of Markov chain.
After all, in the same sense I can also make a Markov chain that reproduces the behaviour of any human, or even any Turing machine with finite tape — it's just that a computer with 1 GB of memory (all kinds including non-volatile) has 2^(8*(10^30)) states.
In my college classes, sometimes we'd get problems involving "construct a markov chain to model this process". Doing the naive, obvious solution wouldn't make the probabilities work because we'd "forget" crucial information, so we'd just add on extra bits to the state (e.g. "flag1 flag2 flag3 ...") that would make the number of nodes exponentially larger.
But the infinite information part of markov chains is really helpful sometimes too, e.g. the position of your random walk wouldn't be encoded in your "state" (which could be infinite), it'd just be returned as a value (if you were trying to compute the distance from the origin or something).
Are you guys just arguing over "at what point does this shoving a billion bits into my state make it not really a 'markov' chain like you'd traditionally think about it"? Basically the formal definition vs what we intuitively think about as the 'point' of markov processes and why the markov observation is important?
> Are you guys just arguing over "at what point does this shoving a billion bits into my state make it not really a 'markov' chain like you'd traditionally think about it"? Basically the formal definition vs what we intuitively think about as the 'point' of markov processes and why the markov observation is important?
That's an accurate description of my perception of the conversation, I can't speak for the others of course.
I've seen this nonsensical equivalence several times now so I'm going to make an example out of you to prove a point. Define the state space for the human Markov chain.
> Define the state space for the human Markov chain.
At most the Berkenstein bound for the mass of a human brain.
Is that big? Yes, too huge to bother calculating. That's why I'm using it as an example to say "it's a Markov chain" is not helpful.
But it is finite, and that means it's mathematically equivalent.
And that's before the thing about transition matrices being (in general if not in particular) square, from any one state to any other.
For fun, here's the Berkenstein bound for the equivalent transition matrix for the much smaller GPT-3 with 2048 context length, measured in (amongst other things) multiples of the size of the universe: http://www.wolframalpha.com/input/?i=sqrt%2812%2A%28%2850000...
(Assuming I didn't fluff the algebra).
Edit: I fluffed the agebra, see if you can spot the typo. It's much much bigger.
What are the states? I didn't say give me a count of the states. I said give me the algorithm for determining whether any given state of the universe corresponds to whatever human you are reducing to a Markov chain. I don't care if it's practical or not because it seems like you're convinced this algorithm exists even though you have never actually gone through the trouble of actually formalizing it. So I'm doing you a favor, you will either learn about your ignorance or you will make your argument algorithmically formal and actually prove that people are equivalent to Markov chains.
Of all the things I've heard people being dismissive of, you're the first I've heard suggest it might possibly be more than trivial to create a map from physical objects to some enumeration.
It's not like simulated physics is somehow a novel field I've had to invent in my head, it's a field that's been performing such mappings, even for quantum systems (on smaller scales than a brain, but you yourself say "I don't care if it's practical or not") for decades already.
In extremis, the Bekenstein bound, you've got a surface which — at a resolution of (IIRC) 2π bits per square Planck area — contains all the information about the interior. There's a lot of ways to map the surface of a sphere into Planck areas, pick any you want. That ordering itself defines a state down to the quantum level.
There's in the order of 10^17 of those Planck areas if a human brain was compressed into a black hole, so that's the maximum information content of any given possible arrangement of matter and energy with the mass of a human brain.
The only thing I'm struggling with right now is why you might be so confident that this is difficult that you're even thinking to try and make an example out of me?
> Do you even know the definition of a Markov chain?
Dude, I've written several Markov chains, starting aged something like six while still learning to read. (Family had a Commodore 64, one of the books it came with had one as a coding example).
Markov chains are stochastic processes where the transitions between possible nodes have probabilities depending only on the current node, i.e. history is irrelevant.
So, right back'atcha, if you can't see how what I wrote is a proof of equivalence — you've got a way to map reality (of a human brain) onto a number between 0 and ~2^(10^17)*.
Or perhaps, despite your previous question being answered, your actual issue is that you were unaware that the laws of quantum mechanics give you a stochastic processes where the transitions between possible states have probabilities depending only on the current state?
* exponent times whatever the constant of proportionality is
Great, we're making progress. Give me the dimension of the Hilbert space, the wave function, and the unitary operator that corresponds to your daily activities in terms of a Markov chain. Take all the time you want to figure it out.
It's just formal game but you still haven't managed to write down the state space nor specify the transition probabilities for a single atom let alone a person.
> It's just formal game but you still haven't managed to write down the state space nor specify the transition probabilities for a single atom let alone a person.
False.
I've given you the state space. It's the one defined by 2^(k*10^17) bits. That's your state space. I don't know why you have a hard time with this concept, it doesn't seem particularly challenging to me.
I've also said, in what I thought was quite clear language, that writing down the transition probabilities for even a system significantly simpler than a human brain would require so much information that by the Bekenstein bound it would require a universe substantially larger than this one to avoid becoming a black hole.
If you're not going to accept that the very specific things you asked for — things which you seem to accept exist — are already sufficient to be proof that quantum mechanics and by extension all of reality including humans are in the set of things defined by Markov chains unless I literally destroy the universe by writing the function down in that fashion, then we're done.
Or are you shifting the goal posts when you wrote this?:
> and the unitary operator that corresponds to your daily activities in terms of a Markov chain
Given that what actually fits into a comment box is either a super-high-level representation where it's transitions at a macro level like "sleep" -> "get up" -> "go to work" etc. or a super-low-level representation like iħ ∂ψ/∂t |ψ> = (H^)|ψ>
That's only true for Markov chains with an (effectively) infinite state space. If the state space is finite, such as with an LLM with a finite set of tokens and a finite context length, it's trivial to tell the difference between the behavior of a specific Markov chain and a more sophisticated model. Once you have a specific Markov chain, the state space is fixed, and you can just ask something that requires a larger state space.
The same basic idea can be found everywhere in mathematics and CS. Once you have chosen the value for your parameter, I can choose the value for my parameter to guarantee the desired outcome.
No, LLM just returns a distribution of states, you can use a real random function to make the transition if you want it has nothing to do with the LLM itself, the pseudoprobability function is just an optimization since you need hardware to get really random numbers.
Its because they're speaking colloquially instead of pedantically. i.e. an LLM is just a bag of bytes is pedantically true, but misses enough that I'd expect to get corrections :)
The abstract mentions "stationary distributions". So they try to derive something new from that borderline obvious equivalence. The Google search engine originally used the stationary distribution to rank the importance of search results.
I'm not sure how this distinguishes the process from a table of all possible input context windows(and rng state if you want temperature) with their desired output.
This has been argued as a theoretical(but larger than the universe) model in philosophy for long enough that I wouldn't be surprised if it predates the notion of creating a machine intelligence.
Without any mechanism for generalisation, it can be argued that it is simply a record of an intelligence with the actual mind being whatever created the table
I guess you can think of Markov chains as a form of that table shrunk down to a hash table of the input storing only a set of the most probable outputs for any collisions.
With the right semantic hashing you could consider it analogous to a LLM, but I feel there's a lot leaning on the term 'semantic hashing', because that's where the magic is in the LLM.
For example, we can write: (foo (bar (baz x))) as (-> x baz bar foo)
If there are additional arguments, we can accommodate those too: (sin (* x pi) as (-> x (* pi) sin)
Where expression so far gets inserted as the first argument to any form. If you want it inserted as the last argument, you can use ->> instead:
(filter positive? (map sin x)) as (->> x (map sin) (filter positive?))
You can also get full control of where to place the previous expression using as->.
Full details at https://clojure.org/guides/threading_macros
reply