Heh?
Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a no-brainer?
edit make that fast convert,rcp,fma (float 16 constant 1.0) and xor (same constant)
Unfortunately none of the hardware used for testing supports FP16 arithmetic. Between Intel and AMD, the only platform that supports AVX512-FP16 is currently Sapphire Rapids.
I tried a similar approach with 32-bit FP before, and the problem here is that fast conversion is only fast in the sense of latency. Throughput-wise, it takes 2 uops instead of one, so in the end, a plain float<->int conversion wins.