It's interesting that the standard "K" (number of elements with a shared scale) is 32. That seems to imply that the neural network will somehow learn to group weights at those 32-element boundaries.
Does anybody understand how that works? I mean, what is the mechanism that naturally causes the model to group weight scales into those K-element clusters?
There is no mechanism per-say, it's more of a bit space vs quality issue. You could think of MX4 with an 8 bit exponent scale as a 12 bit number if the block size is one, "MX12" with E10M1. You can share the scale with some error per element in a block, with that error going up as you increase the size of the block. As the block size is increased, the effective size per element goes down and the hardware implementation gets smaller/cheaper.