Nice writeup, but regarding title -- I find it fascinating how powerful attentio...

kouteiheika · 2025-05-23T22:39:39 1748039979

> There were some tweaks developedz sure, but if I open Llama 4 code on HugginFace, it is more or less the same code that I've seen there 5 years ago.

This is very much true. It's essentially the very same architecture, just tweaked slightly.

I can take the code I've written which implements the original GPT-2, tweak it very minimally (I don't know, maybe 30~40 lines of code changed?) and get Qwen3 which is a state-of-art model released ~3 weeks ago.

Contrary to what you might see when looking at e.g. HuggingFace code where every new architecture needs a new multi-thousand line of code file - that's just a result of an insane amount of copy-pasting and technical debt (although they started to clean it up a little bit lately). I have my own custom implementation which can load weights for ~19 different architectures straight off HuggingFace in like ~2k lines of code. They aren't really all that different.

kjkjadksj · 2025-05-23T21:42:15 1748036535

Too many horseriders, not enough horse breeders.

teleforce · 2025-06-02T22:28:50 1748903330

Nice analogy, most probably going to borrow it.

danpalmer · 2025-05-24T00:01:34 1748044894

The Llama models are substantially behind the state of the art, particularly when it comes to efficiency, they’re probably not the best example for adoption of these sorts of techniques.