Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Looks like SIMD implementations that use LUTs should favor small tables that fit in registers and use `vperm2ipd` as look ups over larger tables + gather.

With 64 bits, you still get a LUT size of 16 (shuffle indexes into two 8xdouble vectors), which can be good enough for functions like log and exp.




Loading data from random memory locations became too expensive compared to computations. For log, exp, trigonometry, and similar, I think people rarely use any lookup tables. Instead, they use some high-poly approximations, and for log/exp abuse IEEE binary floats representation.

Here's a log() function from the standard library in OpenBSD: https://github.com/openbsd/src/blob/master/lib/libm/src/e_lo...


LUTs at least do well in microbenchmarks, but I do worry that they may do comparatively much worse in real code. That said, that's another advantage of small tables using vpermi2pd.

The Julia/base implementations of log and exp both use LUTs. The SIMD AVX512 implementation of exp used by LoopVectorization.jl will sometimes use the 16 element table. I experimented with log, but had some difficulty getting accuracy and performance, so the version LoopVectorization.jl currently uses doesn't use a table.


BTW, since you apparently working on the stuff like that, check out that repository:

https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxM...

The license is MIT, copy-paste friendly. It doesn’t use AVX512 though, only AVX1 and optionally 2.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: