Two other things that jumped out at me: VPCONFLICT is 10x as fast, compressstoreu is >10x slower. Those might be enough to warrant a Zen4-specific codepath in Highway.
I benchmarked it on Intel, and it was indeed quite fast/a good improvement over the scalar version. Will be interesting to try that on AMD.
Two other things that jumped out at me: VPCONFLICT is 10x as fast, compressstoreu is >10x slower. Those might be enough to warrant a Zen4-specific codepath in Highway.