I think your mental model could be making LLMs seem more confusing than they are...

HarHarVeryFunny · on June 16, 2023

The "attention" mechanism (a bit of a misnomer really) is what makes transformers more complex than many other neural nets - data isn't simply flowing through the model from layer to layer, but rather it is being copied and moved around by the attention heads. The "next word" it is generating doesn't even have to be a word it has ever seen before - it may be copying it from the prompt.

potatoman22 · on June 17, 2023

Interesting. Any suggestions/references for learning about attention from this perspective?

HarHarVeryFunny · on June 18, 2023

The paper I read was this one from Catherine Olsson et al at Anthropic.

https://transformer-circuits.pub/2022/in-context-learning-an...

There's a useful article here that expands on the types of head composition and provides some illustrations.

https://www.lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-...