GPU's aren't yet awfully efficient at 1 bit math. I could imagine FPGA designs m...

int_19h · on Feb 28, 2024

I don't think it would be difficult to make them efficient.

The main reason why we run this stuff on GPUs is their memory bandwidth, anyway.

sebzim4500 · on Feb 28, 2024

I'm very unconvinced that ASICs are better suited for this than for FP16/FP8 models that are being used today.

londons_explore · on Feb 28, 2024

BF16 is a pretty big unit in an ASIC - You need at least 9 * 5 gates to calculate the exponent of the result, a 10 bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and a 10 bit multiplier (approximately 10 * 10 * 9 gates)

Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.

Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)

So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.