I'm pretty sure anyone finetuning Lllama now on a regular basis is using https://github.com/unslothai/unsloth so comparisons should be against that. The open source version is ~2x faster than default implementations. NVidia only, although the kernels are in Triton so might be portable.
I remember seeing them on HN when the first started! I never understood what’s the price you pay, how did they get such a big speed up and less memory usage?
nice work in gemma-bugs -- compared to plenty of research work that is a km deep in real math, this tech note is a just few python tweaks. But finding those and doing it? apparently this is useful and they did it. Easy to read (almost child-like) writeup.. thx for pointing to this.