Put in a couple hours to see how well it worked. The approximately correct form is slightly faster than just doing an FP division without proper rounding.
uint8_t magic_divide(uint8_t a, uint8_t b) {
float x = static_cast<float>(b);
int i = 0x7eb504f3 - std::bit_cast<int>(x);
float reciprocal = std::bit_cast<float>(i);
return a*1.94285123f*reciprocal*(-x*reciprocal+1.43566f);
}
No errors, ~20 cycles latency on skylake by uiCA. You can optimize it a bit further by approximating the newton raphson step though:
uint8_t approx_magic_divide(uint8_t a, uint8_t b) {
float x = static_cast<float>(b);
int i = 0x7eb504f3 - std::bit_cast<int>(x);
float reciprocal = std::bit_cast<float>(i);
reciprocal = 1.385356f*reciprocal;
return a*reciprocal;
}
Slightly off when a is a multiple of b, but approximately correct. 15.5 cycles on skylake because of the dependency chain.
a*(1.xxx/b) = a*(-1*1.xxx*log2(b)), which means you should be able to do a*(fma(b, magic, constant)) with appropriate conversions on either side. Should work in 32 bits for u8s.
Overall, using F64 directly for I32 division is a well known trick, and it should hold for smaller representations as well.