That’s pretty cool. :) One thing I don’t get is why do multiple operations when ...

compilade · 2025-04-17T14:34:57 1744900497

Because lookup tables are not necessarily faster compared to 8-bit SIMD operations, at least when implemented naïvely.

Lookup tables can be fast, but it's not simpler, see T-MAC https://arxiv.org/abs/2407.00088 (Note that all comparisons with `llama.cpp` were made before I introduced the types from https://github.com/ggml-org/llama.cpp/pull/8151 where the 1.6-bit type uses the techniques described in the aforementioned blog post).

I wanted to try without lookup tables to at least have a baseline, and also because the fixed point packing idea lent itself naturally to using multiplications by powers of 3 when unpacking.

taneq · 2025-04-18T10:34:24 1744972464

Thanks for taking the time to reply! I haven’t done any serious low level optimisation on modern CPUs so most of my intuitions are probably way out of date.