Branchless Lomuto Partitioning

killcoder · on Dec 6, 2023

There was a recent post by Voultapher from the sort-research-rs project on Branchless Lomuto Partitioning

https://github.com/Voultapher/sort-research-rs/blob/main/wri...

Discussion here:

https://news.ycombinator.com/item?id=38528452

This post by orlp (creator of Pattern-defeating Quicksort and Glidesort) was linked to in the above post, and I found both to be interesting.

kazinator · on Dec 6, 2023

What can you actually sort branch-free? Doesn't the comparison have to be inlined, and tiny, not to mention itself branchless? Forget string keys.

IPv4 addresses in a routing table or something.

Pointers, by increasing or decreasing address. Useful if we want to compact the objects in memory.

It's useful to sort floating-point values in a certain way before adding them together, to avoid adding a low magnitude X to a big magnitude Y such that X disappears.

orlp · on Dec 6, 2023

> Doesn't the comparison have to be inlined, and tiny, not to mention itself branchless?

Yes.

This generally means floats and integers, or small combinations thereof. Note that this is after projection of the comparison operator, so things can remain branchless when sorting e.g. strings by length.

The inlining generally isn't an issue in languages like C++ and Rust.

> Forget string keys.

You can sort string keys (mostly) branchlessly with radix sorting, but yes for comparison sorting you can forget it.

anonymoushn · on Dec 7, 2023

For string keys, I think you could do multi-key quicksort with almost no branches, but the partitioning becomes more troublesome because you need to separate elements that are <, =, and > your pivot.

orlp · on Dec 6, 2023

Author here if anyone has any questions.

stavros · on Dec 6, 2023

Thanks for the article! I'm finding it hard to understand this part:

> That is, there is a single iterator scanning the array from left-to-right (jj). If the element is found to belong in the left partition, it is swapped with the first element of the right partition (tracked by ii), otherwise it is left where it is.

So you start left, the first element on the left partition belongs in the left partition, so you swap it into the right partition? And what about the element that was in the right partition, when do you check where that one belongs?

orlp · on Dec 6, 2023

> So you start left, the first element on the left partition belongs in the left partition, so you swap it into the right partition?

Yes and no. You temporarily do swap it into the right partition, but then increment the pointers which redefines where the right partition is: you swap it with the first element of the right partition before incrementing both i and j. In essence what this does is move the entire right partition one step to the right, while putting the previously unknown element before it.

Consider this example:

    l l l l r r r r r ? ? ? ?
            ^         ^
            i         j

Now if the first ? was an r element, you could simply increment j and be done with it. But suppose it was an l element, then you have this scenario:

    l l l l r r r r r l ? ? ?
            ^         ^
            i         j

Note that "the first element of the right partition" is equivalent to "the first element after the left partition". Does it now make more sense that swapping the first element of the right partition and the unknown element (v[i], v[j]) is the right thing to do? After our swap we have this:

            +- swap --+
            v         v
    l l l l l r r r r r ? ? ?
            ^         ^
            i         j

So now incrementing i and j both fixes our invariant:

    l l l l l r r r r r ? ? ?
              ^         ^
              i         j

> And what about the element that was in the right partition, when do you check where that one belongs?

That one still belongs in the right partition, which is why we increment both i and j.

stavros · on Dec 6, 2023

Ahh, thank you, this clarifies things! I was thinking of Quicksort, where j moves left until it meets i, and missed the fact that they both move in the same direction and i runs through the entire list. Thanks!

EDIT: I'm now on the desktop and see that your graphs clarify this much better, on mobile I had Dark Reader and they were invisible. You may want a white background.

orlp · on Dec 6, 2023

> I was thinking of Quicksort, where j moves left until it meets i, and missed the fact that they both move in the same direction and i runs through the entire list.

You were thinking of Hoare partitioning. Quicksort can use either Hoare or Lomuto (or any other) partitioning algorithm.

stavros · on Dec 6, 2023

Yes, now that you mention it, that does make sense, thank you.

lifthrasiir · on Dec 6, 2023

Will it be viable to use SIMD to further optimize the algorithm when the predicate is simple enough? For example, one can imagine that you take v[j] .. v[j+N-1] and somehow weed them into two partitions u[0] .. u[k] and u[k+1] .. u[N-1] (probably in SIMD registers), which get pasted accordingly at v[i] or v[j]. The current assembly indeed looks very tight but a higher throughput at the expense of longer code is a common theme in SIMD, I guess.

orlp · on Dec 6, 2023

SIMD partitioning is definitely a thing, but I think it is ill-suited for Lomuto-style partitioning. I think fulcrum (what crumsort uses) or out-of-place partitioning (in both cases the output buffers for < and >= are distinct) like glidesort does is the most amenable to SIMDization.

Then partitioning is 'simply' a vector comparison, two masked compressing stores (through shuffles or _mm_mask_compressstoreu_epi32) with one of the masks inverted, and counting how many elements were smaller with _mm256_movemask_epi8 and a popcnt.

For an out-of-place partition you can interleave the loops of one going left-to-right and one going right-to-left to increase instruction-level parallelism.

anonymoushn · on Dec 6, 2023

You can port the algorithm from TFA pretty straightforwardly. For example with AVX2 and u64, load 4 elements from i and from j, compare the 4 elements swapped leftward with the threshold, use a lookup table of shuffles and compute shuffle(xs, lut[movemask(compare_result)]). store this value on the left and advance the pointer by popcnt(movemask(compare_result)). store the raw value previously loaded on the right and advance the pointer by 4.

You face some additional trouble when the right region of already-partitioned elements is smaller than 4. I think this is solvable but I don't have a good proposal off the top of my head.

Edit: well, one solution is to look at the first 8 elements. If at least 4 belong on the right, you can process these 8 carefully and then process the input forwards without worrying about your loads overlapping. If fewer than 4 belong on the right, you can swap these 8 with the last 8, process those 8 carefully, and process the input backwards without worrying about your loads overlapping.

All of this is what we have to do anyway when targeting AVX2, even if filtering some elements and discarding others or partitioning to 2 output buffers, because there are no compressing instructions.

orlp · on Dec 6, 2023

That's actually fair enough. I'm not certain how well the CPU will like the overlapping SIMD reads/writes on the left hand side.

If you have AVX512 available shuffle(xs, lut[movemask(compare_result)]) could also be two compressions with one reverse shuffle in between.

anonymoushn · on Dec 6, 2023

I don't know either! I've never done something like this. Usually I am filtering to the same buffer or one or two other buffers, so I would never immediately load some stuff I had just stored.

JonChesterfield · on Dec 6, 2023

Is the assembly the inner loop or the whole function? The function looks like it could be restructured into a tail call - the trailing swap is quite like the loop body - but I'd be surprised if a compiler managed that transform.

orlp · on Dec 6, 2023

The assembly is only for the inner loop, I did not include the preamble/postamble of the functions.

You can look at the full assembly here: https://cpp.godbolt.org/z/zzzTh47PG. But the full assembly isn't a fair comparison right now because I did not bother to convert both functions to have the same signature (Andrey's version assumes the pivot is in the array and selects it in the function).

Virtually all the time for any non-trivial input is spent in the inner loop though.

kazinator · on Dec 6, 2023

Lomuto partitioning is flawed though; when implementing Quicksort, you want the original Hoare one.

Has someone attempted branchless Hoare partitioning?

orlp · on Dec 6, 2023

> Lomuto partitioning is flawed though; when implementing Quicksort, you want the original Hoare one.

Can you elaborate? It is 'flawed' in that it requires more moves than Hoare as I write in the conclusion, but as the benchmarks show for a variety of input sizes, types and distributions it is superior despite the extra data movement.

> Has someone attempted branchless Hoare partitioning?

I mention a paper and three implementations (one of which is my own) that implement some variety of branchless Hoare partitioning in one way or another in the section on Hoare partitioning.

kazinator · on Dec 7, 2023

Another problem is that if you have an array of identical values, Hoare contrived things so that his two partitioning pointers meet in the middle, ensuring O(n log n) behavior. Lomuto produces a degenerate partition, resulting in the quadratic worst case.

orlp · on Dec 7, 2023

My algorithm pdqsort fixes this problem fundamentally which turns many identical values into an ideal O(n log k) case where k is the number of unique values, even if it were to use Lomuto.