I'd also love to see such comparisons. In SimSIMD, if AVX-512FP16 is available, ...

AlotOfReading · 2024-12-22T22:11:45 1734905505

Put in a couple hours to see how well it worked. The approximately correct form is slightly faster than just doing an FP division without proper rounding.

    uint8_t magic_divide(uint8_t a, uint8_t b) {
        float x = static_cast<float>(b);
        int i = 0x7eb504f3 - std::bit_cast<int>(x);
        float reciprocal = std::bit_cast<float>(i);
        return a*1.94285123f*reciprocal*(-x*reciprocal+1.43566f);
    }

No errors, ~20 cycles latency on skylake by uiCA. You can optimize it a bit further by approximating the newton raphson step though:

    uint8_t approx_magic_divide(uint8_t a, uint8_t b) {
        float x = static_cast<float>(b);
        int i = 0x7eb504f3 - std::bit_cast<int>(x);
        float reciprocal = std::bit_cast<float>(i);
        reciprocal = 1.385356f*reciprocal;
        return a*reciprocal;
    }

Slightly off when a is a multiple of b, but approximately correct. 15.5 cycles on skylake because of the dependency chain.

AlotOfReading · 2024-12-21T20:14:39 1734812079

Don't have time to figure out the actual numbers myself, but here's an example doing the same for various transcendental functions: https://github.com/J-Montgomery/hacky_float_math/blob/623ee9...

a*(1.xxx/b) = a*(-1*1.xxx*log2(b)), which means you should be able to do a*(fma(b, magic, constant)) with appropriate conversions on either side. Should work in 32 bits for u8s.