In nn training, you often get better perf at lower cost by increasing the number of parameters than by increasing parameter accuracy. Not sure if we actually know why, but that is usually the case.
E.G. a 7B model at FP32 will perform worse than a 14B model at BF16, all else being equal.
FP32 -> BF16 is really good, because modern GPUs are actually far faster at BF16 multiplications than at multiplications in FP32, not to mention the decreased memory and memory throughput requirements. 4-bit quants are much more of a mixed bag. You get the memory savings, which often means a difference between a cheap consumer GPU and a much more expensive datacenter GPU, or a single GPU versus a node with multiple ones and all the interconnect that entails. Your raw speed won't improve as dramatically, as you still need to convert from BF16 and back to do the actual computation. BF16 would probably make you memory-bound anyway, and 4-bit effectively gives you more throughput, so you get some savings, but the difference won't be as dramatic.
E.G. a 7B model at FP32 will perform worse than a 14B model at BF16, all else being equal.
FP32 -> BF16 is really good, because modern GPUs are actually far faster at BF16 multiplications than at multiplications in FP32, not to mention the decreased memory and memory throughput requirements. 4-bit quants are much more of a mixed bag. You get the memory savings, which often means a difference between a cheap consumer GPU and a much more expensive datacenter GPU, or a single GPU versus a node with multiple ones and all the interconnect that entails. Your raw speed won't improve as dramatically, as you still need to convert from BF16 and back to do the actual computation. BF16 would probably make you memory-bound anyway, and 4-bit effectively gives you more throughput, so you get some savings, but the difference won't be as dramatic.