Sure, I guess I meant: if the task can be run with 4 bit math...

p1esk · on Aug 14, 2018

Google TPU does not have 4 bit ALUs.

gok · on Aug 14, 2018

I don’t think that negates my point? If you have tensor ops that can get away with 4-bit precision, this is a great chip for you.

p1esk · on Aug 14, 2018

Yes, on INT4 tasks this chip will be faster than TPU. That's a fairly rare use case.

TomVDB · on Aug 14, 2018

Is that a limited use case because not many workloads map to INT4 or has this Avenue simply not been explored because there weren’t any INT4 processors?

My understanding is that during inference, precision is often not critical, and that some workloads even work with 1 bit?

p1esk · on Aug 14, 2018

NN quantization has been an area of active research in the last 3 years, but it's not trivial when going to 4 bits or below. Usually to achieve good accuracy during inference, a model needs to be trained or finetuned at low precision. The simple post training conversion usually won't work (it does not always work even at 8 bits). Models that are already efficient (e.g. MobileNet) are harder to quantize than fat, overparameterized models such as AlexNet or VGG. Increasing a model size (number of neurons or filters) helps, but obviously it offsets the gains in efficiency to some degree. Recurrent architectures are harder to quantize.

See Table 6 in [1] to get an idea of the accuracy drop from quantization, it seems like 4 bits would result in about 1% degradation, which is pretty good. However, as you can tell from the methods they used to get there, it's not easy.

[1] https://arxiv.org/abs/1807.10029