Nice writeup, but regarding title -- I find it fascinating how powerful attention really is. There were some tweaks developedz sure, but if I open Llama 4 code on HugginFace, it is more or less the same code that I've seen there 5 years ago. Despite all the AI hype, we are still just exploiting tech developed in 2015-2020. And despite NeurIPS brandishing 25k papers this year, the innovation rate in deep learning seems to stagnate
> There were some tweaks developedz sure, but if I open Llama 4 code on HugginFace, it is more or less the same code that I've seen there 5 years ago.
This is very much true. It's essentially the very same architecture, just tweaked slightly.
I can take the code I've written which implements the original GPT-2, tweak it very minimally (I don't know, maybe 30~40 lines of code changed?) and get Qwen3 which is a state-of-art model released ~3 weeks ago.
Contrary to what you might see when looking at e.g. HuggingFace code where every new architecture needs a new multi-thousand line of code file - that's just a result of an insane amount of copy-pasting and technical debt (although they started to clean it up a little bit lately). I have my own custom implementation which can load weights for ~19 different architectures straight off HuggingFace in like ~2k lines of code. They aren't really all that different.
The Llama models are substantially behind the state of the art, particularly when it comes to efficiency, they’re probably not the best example for adoption of these sorts of techniques.