(For other readers:) This is what our Highway library does - wrapper functions a...

camel-cdr · 2025-04-26T13:42:52 1745674972

I quite like highway.

As mentioned, last time I tried vqsort for RVV it was surprisingly slow.

I tried to replicate it yesterday, but noticed that vqsort is now disabled for RVV: https://github.com/google/highway/blob/400fbf20f2e40b984be12...

Does highway support sorting networks for non-128-bit vector registers?

When I tried to compile it for AVX512, the BaseCase seems to only use xmm registers: https://godbolt.org/z/qr9xoTGKn

janwas · 2025-04-26T13:56:38 1745675798

:) Yes, vqsort recently tickled a bug in clang. I've seen a steady stream of issues, many caused by SLP or the seeming absence of CI. You might try re-enabling it on GCC.

Yes, the issue with the sorting network is that it is limited to 16x16 to reduce code explosion. With uint16_t, XMM are sufficient for the 8-column case; your Godbolt link does have some YMM for the 16-column case. When changing the type to sort to uint32_t, we see ZMM as expected.

camel-cdr · 2025-04-26T14:42:34 1745678554

Btw, here is a VLA vector register sort: https://godbolt.org/z/Env64961q

It has a few more instructions then the VLS version, but the critical dependency chain is the same.

It's also slightly less optimal on x86, because it alway uses lane crossing permutes. For AVX512 that is 5 out of 15 permutations that are vperm, but could've been vshuf. (if the loop isn't unrolled and optimized by the compiler)

I wasn't able to figure out how to implement the multi vector register sort in a VLA way.

janwas · 2025-04-26T17:29:08 1745688548

Nice work :) Clang x86 indeed unrolls, which is good. But setting the CC and AA mask constants looks fairly expensive compared to fixed-pattern shuffles.

Yes, the 2D aspect of the sorting network complicates things. Transposing is already harder to make VLA and fusing it with the other shuffles certainly doesn't help.