Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure, I guess I meant: if the task can be run with 4 bit math...


Google TPU does not have 4 bit ALUs.


I don’t think that negates my point? If you have tensor ops that can get away with 4-bit precision, this is a great chip for you.


Yes, on INT4 tasks this chip will be faster than TPU. That's a fairly rare use case.


Is that a limited use case because not many workloads map to INT4 or has this Avenue simply not been explored because there weren’t any INT4 processors?

My understanding is that during inference, precision is often not critical, and that some workloads even work with 1 bit?


NN quantization has been an area of active research in the last 3 years, but it's not trivial when going to 4 bits or below. Usually to achieve good accuracy during inference, a model needs to be trained or finetuned at low precision. The simple post training conversion usually won't work (it does not always work even at 8 bits). Models that are already efficient (e.g. MobileNet) are harder to quantize than fat, overparameterized models such as AlexNet or VGG. Increasing a model size (number of neurons or filters) helps, but obviously it offsets the gains in efficiency to some degree. Recurrent architectures are harder to quantize.

See Table 6 in [1] to get an idea of the accuracy drop from quantization, it seems like 4 bits would result in about 1% degradation, which is pretty good. However, as you can tell from the methods they used to get there, it's not easy.

[1] https://arxiv.org/abs/1807.10029




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: