Maybe it's clear to others, but it's certainly not to me, how exactly transformers - or rather transformer-based LLMs - are operating.
I understand how transformers work, but my mental model is that a transformer is the processor and the LLM is an application that runs on it. After all, transformers can be trained to do lots of things, and what it learns when trained with a "predict next word" LLM objective is going to differ from what it learns (and hence operates?) in a different setting.
There have been various LLM interpretation papers analyzing aspects of them, such as the discovery of pairs of consecutive layer attention heads acting as "search and copy" "induction heads", and analysis of the linear layers as key-value stores, which perhaps leads to another weak abstraction of the linear layers as storing knowledge and perhaps the reasoning "program", with the "attention" layers being the mechanism being programmed to do the data tagging/shuffling ?
No doubt there's a lot more to be discovered about how these LLMs are operating - perhaps a wider variety of primitives built out of attention heads other than just induction heads ? It seems a bit early to be building a high level model of the primitives these LLMs have learnt, and not sure if attempting a crude transformer-level model really works given how the residual context is additive - it's not just tokens being moved around.
Transformers - and all other ML models - are ways to represent computer programs. You can think of them as a programming language designed to be easy for optimization instead of for human understanding.
It's not clear to anybody exactly what kind of program structure LLMs have internally. Figuring that out is a major goal for the field of mechanistic interpretability.
I think your mental model could be making LLMs seem more confusing than they are. LLMs are stacks of transformers and generative LLMs typically have another model that samples the transformer output.
Maybe there's useful abstractions for analyzing them, but LLMs are just another deep learning model.
The "attention" mechanism (a bit of a misnomer really) is what makes transformers more complex than many other neural nets - data isn't simply flowing through the model from layer to layer, but rather it is being copied and moved around by the attention heads. The "next word" it is generating doesn't even have to be a word it has ever seen before - it may be copying it from the prompt.
This has been built on extensively over the past two years. For instance: Tighter Bounds on the Expressivity of Transformer Encoders https://arxiv.org/abs/2301.10743. I find it interesting that transformers are equivalent to first order logic on circuits with counters. Amazing what you can do even if you're not Turing complete!
Maybe, but a CPU is also Turing-complete yet a (for example) sort program running on a CPU is just a sort program. The functionality of an LLM is defined by whatever it learnt during it's (dataset-specific) training, even if that includes in-context and one-shot "learning".
You could train a Turing-complete transformer to do a different task than running an LLM, but once you've trained it to run/be an LLM, then that is what it is.
A CPU is a finite state machine, so adding an unbounded tape is trivial to make a theoretical TC.
The arbitrary precision activation function and position requirements are to keep the attention dynamic reweighting values in the computable set.
As even multi layer neural networks use the shifting, reflection and sum of line segments to produce their curve, the results of those operations may not map to representable numbers even given unbounded digits when using typical activation functions.
Using an activation function that keeps results in aleph-nought, or a countable infinity is what allows for it to be TC.
Probably Approximately Correct or PAC learning is intentionally fuzzy.
The occasional gradant loss problem with ReLU is possibly a lens to think about this in.
But the success of statistical learning in the past 30 years has been largely related to having existential quantifiers with acceptable training loss. Following the very useful concept from stats that all models are wrong but some are useful.
Transformer models will most definitely be useful for some problems, assuming that a physically unrealizable configuration is TC will hold will lead to wasted efforts.
Simply acknowledging the potential dead ends of a technology helps with not only choosing the right path but recognizing early that you need to change course.
IMHO, this posts papers method as a lens is far more useful as an intuition.
No, they are actually very limited formally. For example you can't model a language of nested brackets to arbitrary depth (as you can with an RNN). That makes it all the more interesting that they are so successful.
Being technically maybe turing complete doesn't mean we know how to program it usefully.
https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/To be completely fair, the Transformer architecture does not map neatly into being analysed like automata and categorised in the Chomsky Hierarchy. Neural Networks and the Chomsky Hierarchy train different architectures on formal languages curated from different levels of the Chomky hierarchy.
There is also a more interactive version if you want to challenge yourself. A Python notebook of interactive puzzles to build an adder with transformers.
is it possible to describe important science (or business) elements without using direct religious terms for other purposes? "angels" "bible" "satan" etc?
This is cool but I think a more fundamental primitive is the probability distribution over next tokens and how that changes depending on each layers computation.
I understand how transformers work, but my mental model is that a transformer is the processor and the LLM is an application that runs on it. After all, transformers can be trained to do lots of things, and what it learns when trained with a "predict next word" LLM objective is going to differ from what it learns (and hence operates?) in a different setting.
There have been various LLM interpretation papers analyzing aspects of them, such as the discovery of pairs of consecutive layer attention heads acting as "search and copy" "induction heads", and analysis of the linear layers as key-value stores, which perhaps leads to another weak abstraction of the linear layers as storing knowledge and perhaps the reasoning "program", with the "attention" layers being the mechanism being programmed to do the data tagging/shuffling ?
No doubt there's a lot more to be discovered about how these LLMs are operating - perhaps a wider variety of primitives built out of attention heads other than just induction heads ? It seems a bit early to be building a high level model of the primitives these LLMs have learnt, and not sure if attempting a crude transformer-level model really works given how the residual context is additive - it's not just tokens being moved around.