Is that a limited use case because not many workloads map to INT4 or has this Avenue simply not been explored because there weren’t any INT4 processors?
My understanding is that during inference, precision is often not critical, and that some workloads even work with 1 bit?
NN quantization has been an area of active research in the last 3 years, but it's not trivial when going to 4 bits or below. Usually to achieve good accuracy during inference, a model needs to be trained or finetuned at low precision. The simple post training conversion usually won't work (it does not always work even at 8 bits). Models that are already efficient (e.g. MobileNet) are harder to quantize than fat, overparameterized models such as AlexNet or VGG. Increasing a model size (number of neurons or filters) helps, but obviously it offsets the gains in efficiency to some degree. Recurrent architectures are harder to quantize.
See Table 6 in [1] to get an idea of the accuracy drop from quantization, it seems like 4 bits would result in about 1% degradation, which is pretty good. However, as you can tell from the methods they used to get there, it's not easy.