I think the point GP has, is that zero-mask ops prevent/break false dependencies on the destination register, and moreover, that this becomes a useful tool the more conditionally-executed-by-masking vectorized code you have in an algorithm body, and may also (caveat reader: I am now speculating) be a reason why AVX-512 came with so many damn registers, because they're super useful for intermediate/partial results.
Unfortunately the SysV ABI interferes with compilers allocating upper SIMD registers, since they're all call-clobbered. This motivates bigger functions: almost all my intentionally vectorized/vectorizable code is declared inline and very occasionally I've resorted, reluctantly, to inline asm. Whether the ABI design is actually a mistake, and then how/whether it might be remediated, remains a matter of opinion.
Digression:
The consequence of all this is there's often More Than One Way To Do It, which no matter how much mechanical sympathy you might hope to innately possess still means punching lots of variants on your code into uica/iaca et al to paint anything like a decent picture about bottlenecks, as well as doing your damnedest to ensure that any benchmarking of loops/computation you care to perform during development actually corresponds to real execution. The holy grail, viz. writing C or other HLL that auto-vectorizes well on more than one compiler and more than one architecture (because you wanted to support NEON, too, right?), becomes a near-bottomless programmer time sink.
There are real benefits to be had, but given the additional time-investment required to obtain those benefits, it's little wonder that AVX-512 is shortchanged on intentional adoption, and that's even before Intel started crippling Alder Lake. In the long run, only greater strides in compiler auto-vectorization capabilities will fix this for everyday code.
If it's a false dependency, then it doesn't matter what's in the inactive lanes and a simple unpredicated instruction will break the dependency just as well as a zero masking one. Which is exactly what you said the compiler already did.
AVX-512 has 32 registers because the Pentium core Larrabee was developed against was in-order. In a real sense, the P5 core dictated much of AVX-512's design.
There isn't a useful way to define a general ABI with callee saved vector registers without saying something like "only bits [127:0] are saved"