Intuitively I've always been a bit skeptical of quantization. Wouldn't there be a tiny loss in precision by doing this type of quantization? I could imagine the error function increasing by utilizing these types of techniques.
John Carmack pointed out (and I learned it here at HN) that what training really needs is the *sign" of each individual gradient parameter. I.e., you can quantize gradient to -1, 0 and 1 and still have neural network learn much of the dataset.
Why isn't John Carmack working for OpenAI? Hell, why did he waste years at Meta to work on a VR headset and NOT AI? He even announced he wants to focus on AGI but he missed out on literally all the action.
That game engine was over 3 decades ago! John is one of the sharpest minds I've ever seen, if he's passionate on AGI, he surely has much deeper understanding what he's doing than the AI trendies on social media.
> It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made.
It does increase the “error” (meaning it is less likely to predict the next word when compared against a dataset) but the losses are lower than your intuition would guide you to believe.
Quantization does reduce quality of the outputs. But the point is that you save enough memory doing so that you can cram a larger model into the same hardware, and this more than compensates for lost precision.
Yes each weight will not be able to "learn" as much if it has less bits of precision. But the idea is that you can use more weights, and the big question is whether these low-precision weights can make the model more accurate, as a whole.