Thinking Like Transformers (2021) [pdf]

HarHarVeryFunny · on June 15, 2023

Maybe it's clear to others, but it's certainly not to me, how exactly transformers - or rather transformer-based LLMs - are operating.

I understand how transformers work, but my mental model is that a transformer is the processor and the LLM is an application that runs on it. After all, transformers can be trained to do lots of things, and what it learns when trained with a "predict next word" LLM objective is going to differ from what it learns (and hence operates?) in a different setting.

There have been various LLM interpretation papers analyzing aspects of them, such as the discovery of pairs of consecutive layer attention heads acting as "search and copy" "induction heads", and analysis of the linear layers as key-value stores, which perhaps leads to another weak abstraction of the linear layers as storing knowledge and perhaps the reasoning "program", with the "attention" layers being the mechanism being programmed to do the data tagging/shuffling ?

No doubt there's a lot more to be discovered about how these LLMs are operating - perhaps a wider variety of primitives built out of attention heads other than just induction heads ? It seems a bit early to be building a high level model of the primitives these LLMs have learnt, and not sure if attempting a crude transformer-level model really works given how the residual context is additive - it's not just tokens being moved around.

Juicyy · on June 15, 2023

I saved this HN thread from a couple months ago that had a lot of great resources.

https://news.ycombinator.com/item?id=35697627

Legend2440 · on June 15, 2023

Transformers - and all other ML models - are ways to represent computer programs. You can think of them as a programming language designed to be easy for optimization instead of for human understanding.

It's not clear to anybody exactly what kind of program structure LLMs have internally. Figuring that out is a major goal for the field of mechanistic interpretability.

potatoman22 · on June 15, 2023

I think your mental model could be making LLMs seem more confusing than they are. LLMs are stacks of transformers and generative LLMs typically have another model that samples the transformer output.

Maybe there's useful abstractions for analyzing them, but LLMs are just another deep learning model.

HarHarVeryFunny · on June 16, 2023

The "attention" mechanism (a bit of a misnomer really) is what makes transformers more complex than many other neural nets - data isn't simply flowing through the model from layer to layer, but rather it is being copied and moved around by the attention heads. The "next word" it is generating doesn't even have to be a word it has ever seen before - it may be copying it from the prompt.

potatoman22 · on June 17, 2023

Interesting. Any suggestions/references for learning about attention from this perspective?

HarHarVeryFunny · on June 18, 2023

The paper I read was this one from Catherine Olsson et al at Anthropic.

https://transformer-circuits.pub/2022/in-context-learning-an...

There's a useful article here that expands on the types of head composition and provides some illustrations.

https://www.lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-...

inciampati · on June 15, 2023

This has been built on extensively over the past two years. For instance: Tighter Bounds on the Expressivity of Transformer Encoders https://arxiv.org/abs/2301.10743. I find it interesting that transformers are equivalent to first order logic on circuits with counters. Amazing what you can do even if you're not Turing complete!

tambourine_man · on June 15, 2023

Transformers are Turing complete, right?

nyrikki · on June 15, 2023

The paper from yesterday:

https://news.ycombinator.com/item?id=36332033

Showed that attention with positional encodings and arbitrary precision rational activation functions is Turing complete.

Using a finite precision, nonrational activation function and/or without positional encodings is not Turning complete.

Plus Turing completeness does not tell you anything about practical computation in reasonable time or space constraints.

printf() format strings are TC, and while interesting, probably won't help you solve real problems.

HarHarVeryFunny · on June 15, 2023

Maybe, but a CPU is also Turing-complete yet a (for example) sort program running on a CPU is just a sort program. The functionality of an LLM is defined by whatever it learnt during it's (dataset-specific) training, even if that includes in-context and one-shot "learning".

You could train a Turing-complete transformer to do a different task than running an LLM, but once you've trained it to run/be an LLM, then that is what it is.

nyrikki · on June 15, 2023

A CPU is a finite state machine, so adding an unbounded tape is trivial to make a theoretical TC.

The arbitrary precision activation function and position requirements are to keep the attention dynamic reweighting values in the computable set.

As even multi layer neural networks use the shifting, reflection and sum of line segments to produce their curve, the results of those operations may not map to representable numbers even given unbounded digits when using typical activation functions.

Using an activation function that keeps results in aleph-nought, or a countable infinity is what allows for it to be TC.

Probably Approximately Correct or PAC learning is intentionally fuzzy.

The occasional gradant loss problem with ReLU is possibly a lens to think about this in.

But the success of statistical learning in the past 30 years has been largely related to having existential quantifiers with acceptable training loss. Following the very useful concept from stats that all models are wrong but some are useful.

Transformer models will most definitely be useful for some problems, assuming that a physically unrealizable configuration is TC will hold will lead to wasted efforts.

Simply acknowledging the potential dead ends of a technology helps with not only choosing the right path but recognizing early that you need to change course.

IMHO, this posts papers method as a lens is far more useful as an intuition.

inciampati · on June 15, 2023

Beautiful synopsis, thank you!

canjobear · on June 15, 2023

No, they are actually very limited formally. For example you can't model a language of nested brackets to arbitrary depth (as you can with an RNN). That makes it all the more interesting that they are so successful.

sp332 · on June 15, 2023

Being technically maybe turing complete doesn't mean we know how to program it usefully.

https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/ To be completely fair, the Transformer architecture does not map neatly into being analysed like automata and categorised in the Chomsky Hierarchy. Neural Networks and the Chomsky Hierarchy train different architectures on formal languages curated from different levels of the Chomky hierarchy.

rolisz · on June 15, 2023

To quote someone: RASP is like Matlab, designed by Satan.

There is an interpreter for a RASP like language if you want to try it out: https://srush.github.io/raspy/

And deepmind published a compiler from RASP to Transformer weights: https://github.com/deepmind/tracr

srush · on June 15, 2023

There is also a more interactive version if you want to challenge yourself. A Python notebook of interactive puzzles to build an adder with transformers.

https://github.com/srush/Transformer-Puzzles

gailw · on June 15, 2023

there's also an interpreter for RASP as described in the paper :) https://github.com/tech-srl/RASP

And Sasha's blog (your link) has a nice walkthrough of long addition with RASP!

mistrial9 · on June 15, 2023

is it possible to describe important science (or business) elements without using direct religious terms for other purposes? "angels" "bible" "satan" etc?

--posting for a friend

ljlolel · on June 15, 2023

This is cool but I think a more fundamental primitive is the probability distribution over next tokens and how that changes depending on each layers computation.