I wanted to try without lookup tables to at least have a baseline, and also because the fixed point packing idea lent itself naturally to using multiplications by powers of 3 when unpacking.
Thanks for taking the time to reply! I haven’t done any serious low level optimisation on modern CPUs so most of my intuitions are probably way out of date.