Hacker News new | past | comments | ask | show | jobs | submit login

I'd also love to see such comparisons. In SimSIMD, if AVX-512FP16 is available, I use FP16 for most I8/U8 operations, but can't speak about division.

Overall, using F64 directly for I32 division is a well known trick, and it should hold for smaller representations as well.




Put in a couple hours to see how well it worked. The approximately correct form is slightly faster than just doing an FP division without proper rounding.

    uint8_t magic_divide(uint8_t a, uint8_t b) {
        float x = static_cast<float>(b);
        int i = 0x7eb504f3 - std::bit_cast<int>(x);
        float reciprocal = std::bit_cast<float>(i);
        return a*1.94285123f*reciprocal*(-x*reciprocal+1.43566f);
    }
No errors, ~20 cycles latency on skylake by uiCA. You can optimize it a bit further by approximating the newton raphson step though:

    uint8_t approx_magic_divide(uint8_t a, uint8_t b) {
        float x = static_cast<float>(b);
        int i = 0x7eb504f3 - std::bit_cast<int>(x);
        float reciprocal = std::bit_cast<float>(i);
        reciprocal = 1.385356f*reciprocal;
        return a*reciprocal;
    }
Slightly off when a is a multiple of b, but approximately correct. 15.5 cycles on skylake because of the dependency chain.


Don't have time to figure out the actual numbers myself, but here's an example doing the same for various transcendental functions: https://github.com/J-Montgomery/hacky_float_math/blob/623ee9...

a*(1.xxx/b) = a*(-1*1.xxx*log2(b)), which means you should be able to do a*(fma(b, magic, constant)) with appropriate conversions on either side. Should work in 32 bits for u8s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: