Mask with zeroing would solve that. The EVEX prefix supports both merge and zero...

brigade · on June 20, 2023

All that solves is changing the merging instruction from a masked merge to a maskless OR.

colejohnson66 · on June 20, 2023

Zero merging breaks the dependency chain on the destination; all masked out lanes are set to zero. What do you mean a "maskless OR"?

brigade · on June 20, 2023

Zero masking doesn't merge. If you're discarding lanes with a separate merge op, it doesn't matter what the discarded value was.

inopinatus · on June 20, 2023

I think the point GP has, is that zero-mask ops prevent/break false dependencies on the destination register, and moreover, that this becomes a useful tool the more conditionally-executed-by-masking vectorized code you have in an algorithm body, and may also (caveat reader: I am now speculating) be a reason why AVX-512 came with so many damn registers, because they're super useful for intermediate/partial results.

Unfortunately the SysV ABI interferes with compilers allocating upper SIMD registers, since they're all call-clobbered. This motivates bigger functions: almost all my intentionally vectorized/vectorizable code is declared inline and very occasionally I've resorted, reluctantly, to inline asm. Whether the ABI design is actually a mistake, and then how/whether it might be remediated, remains a matter of opinion.

Digression:

The consequence of all this is there's often More Than One Way To Do It, which no matter how much mechanical sympathy you might hope to innately possess still means punching lots of variants on your code into uica/iaca et al to paint anything like a decent picture about bottlenecks, as well as doing your damnedest to ensure that any benchmarking of loops/computation you care to perform during development actually corresponds to real execution. The holy grail, viz. writing C or other HLL that auto-vectorizes well on more than one compiler and more than one architecture (because you wanted to support NEON, too, right?), becomes a near-bottomless programmer time sink.

There are real benefits to be had, but given the additional time-investment required to obtain those benefits, it's little wonder that AVX-512 is shortchanged on intentional adoption, and that's even before Intel started crippling Alder Lake. In the long run, only greater strides in compiler auto-vectorization capabilities will fix this for everyday code.

brigade · on June 21, 2023

If it's a false dependency, then it doesn't matter what's in the inactive lanes and a simple unpredicated instruction will break the dependency just as well as a zero masking one. Which is exactly what you said the compiler already did.

AVX-512 has 32 registers because the Pentium core Larrabee was developed against was in-order. In a real sense, the P5 core dictated much of AVX-512's design.

There isn't a useful way to define a general ABI with callee saved vector registers without saying something like "only bits [127:0] are saved"