Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That’s pretty cool. :) One thing I don’t get is why do multiple operations when a 243-entry lookup table would be simpler and hopefully faster?


Because lookup tables are not necessarily faster compared to 8-bit SIMD operations, at least when implemented naïvely.

Lookup tables can be fast, but it's not simpler, see T-MAC https://arxiv.org/abs/2407.00088 (Note that all comparisons with `llama.cpp` were made before I introduced the types from https://github.com/ggml-org/llama.cpp/pull/8151 where the 1.6-bit type uses the techniques described in the aforementioned blog post).

I wanted to try without lookup tables to at least have a baseline, and also because the fixed point packing idea lent itself naturally to using multiplications by powers of 3 when unpacking.


Thanks for taking the time to reply! I haven’t done any serious low level optimisation on modern CPUs so most of my intuitions are probably way out of date.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: