> Lomuto partitioning is flawed though; when implementing Quicksort, you want the original Hoare one.
Can you elaborate? It is 'flawed' in that it requires more moves than Hoare as I write in the conclusion, but as the benchmarks show for a variety of input sizes, types and distributions it is superior despite the extra data movement.
> Has someone attempted branchless Hoare partitioning?
I mention a paper and three implementations (one of which is my own) that implement some variety of branchless Hoare partitioning in one way or another in the section on Hoare partitioning.
Another problem is that if you have an array of identical values, Hoare contrived things so that his two partitioning pointers meet in the middle, ensuring O(n log n) behavior. Lomuto produces a degenerate partition, resulting in the quadratic worst case.
My algorithm pdqsort fixes this problem fundamentally which turns many identical values into an ideal O(n log k) case where k is the number of unique values, even if it were to use Lomuto.
Has someone attempted branchless Hoare partitioning?