Looks like SIMD implementations that use LUTs should favor small tables that fit in registers and use `vperm2ipd` as look ups over larger tables + gather.
With 64 bits, you still get a LUT size of 16 (shuffle indexes into two 8xdouble vectors), which can be good enough for functions like log and exp.
Loading data from random memory locations became too expensive compared to computations. For log, exp, trigonometry, and similar, I think people rarely use any lookup tables. Instead, they use some high-poly approximations, and for log/exp abuse IEEE binary floats representation.
LUTs at least do well in microbenchmarks, but I do worry that they may do comparatively much worse in real code.
That said, that's another advantage of small tables using vpermi2pd.
The Julia/base implementations of log and exp both use LUTs.
The SIMD AVX512 implementation of exp used by LoopVectorization.jl will sometimes use the 16 element table.
I experimented with log, but had some difficulty getting accuracy and performance, so the version LoopVectorization.jl currently uses doesn't use a table.
With 64 bits, you still get a LUT size of 16 (shuffle indexes into two 8xdouble vectors), which can be good enough for functions like log and exp.