How is FP fundamentally different than integers? I've done FPGA programming and it just seems like the programmer has to decide where/when to do the shifting based on the expected range of the data. I'm not sure how this is "natively supported" in hardware.
If you have designed FPUs you should know that FP computation involves a lot more additional operations than just shifting (e.g. rounding, subnormals, and special value handling). That’s why, for example, CPUs use different hardware blocks for INT vs FP computation.
But that’s not the point. The point is, this particular method to speed up matmul is not suitable for FP.