Neat algorithm, but it is largely comparing apples to oranges. std::sort is universal, ska_sort is not:
> Another problem is that I’m not sure what to do for data that I can’t sort. For example this algorithm can not sort a vector of std::sets
> Another problem is that right now there can only be one sorting behavior per type. You have to provide me with a sort key, and if you provide me with an integer, I will sort your data in increasing order. If you wanted it in decreasing order, there is currently no easy interface to do that.
Useful algorithm, but we already knew there are faster algorithms if we introduce additional requirements!
The O(n log n) lower bound is only for comparison based sorts. If you don't take a comparison function, but instead look at the internal structure, you can do better, but of course that depends on that internal structure and what order you want them in.
This is a classic. In fact it's close to the first algorithm ever patented - SyncSort, patented in 1971. Still being developed and sold, SyncSort remains the leading technology for big sorts.[1] When you need to sort a few terabytes, they have a product for that. It's a radix sort with self-adjusting bucket boundaries. So it can deal with keys that are not well distributed.
Can't test the code as there are numerous compilation errors with my version of gcc and libc++. Besides what others have written, the author seems to have taken only one measurement for each case. And more importantly there is no mention of randomization at all. If he sorted all input sizes using algorithm A and then all input sizes using algorithm B (instead of doing it in random order) it is possible that there is bias in the measurements (due to cache, branch prediction, etc), specially if he used the same input, just varying the size. I'm also worried about what he used to measure time, as he mentions that for a small number of elements it takes a few microseconds (gettimeofday(3) has a resolution of 1 microsecond, so the measured time is dangerously close to the clock resolution - clock_gettime(3) should be use instead if available, having a resolution of 1ns - or the C++ equivalent for nanosecond precision). Making the algorithm code available is not enough, it is just as important (perhaps even more) to make the benchmark code available. The "tests and benchmarks" code available on Github is a .hpp unfit for reproduction. I want to see the raw data used in the plots and the code to generate it, and I want to be able to run it myself. Otherwise there's no reason to trust it.
It is impossible for comparison-based sorts; sketch of proof: there are N! different arrangements. sorting is equivalent to figuring which one of those N! arrangements we are seeing. In the worst case, each binary comparison can reduce at most, 50% of the search space. Hence the worst case may need log_2(N!) comparisons, which is O(N log N)
However, it might be possible if comparisons are not essential - e.g. radix sort is (with some assumptions) O(N)
You only ever have to move every element once to get a (pre-known) permutation that represents the sorted list.
The problem is the information in a permutation (n! possible orderings) and one can only ever "throw away" half of them on every comparison, leading to log_2(n!) or nlog(n).
The STL sort uses operator<, so they can't implement it as radix sort without changing the interface. Maybe they could manage to special case it for some fundamental types, but most likely they'd need to add a new function, like for stable_sort.
This is correct. The C++ standard specifies that sort must have a big-O complexity of n log(n). This can be seen on pdf page 925 of this working draft from 2014 [1].
No, f = O(n log n) means that f doesn't grow significantly faster than n log n. That is true for f = n log n, but it's also true for f = n. The "or better" is implicit by using big-O.
Note that this wouldn't be true if the standard said that it has to be Θ(n log n).
STL does use hybrid sorting. It is usually implemented as an introsort, which begins with a quicksort and switches to heapsort based on the number of elements to be sorted.
I think the author just wanted to point out that "the" STL is not valid / accurate here; while the interface and behavior is defined, the implementation is afaik still vendor specific, e.g., the implementation in GCC may be different from the one provided by Microsoft.
Just to share since it is related, you can sort faster than a naive O(nlogn) algorithm whenever data is in the form of partitions where the partitions are sorted, but the contents of each partition are not. i.e. The largest element in each partition is smaller than the smallest in the next partition.
There the total number of elements are n and the average size of a partition is m. Then you need only apply any O(log(m)) algorithm to each partition to sort everything. That multiplies the time complexity by n/m. Then substitute n = m^c and the sort becomes O(n/c*log(n)) time. That gives you a factor of c speed up on what would otherwise be an O(nlogn) operation.
This trick is usable when you want to sort a B-Tree like structure where the data in each node is unsorted, but the nodes themselves are sorted. The file system hierarchy on a machine is like that. The default output of zfs list is also like that.
> This is a trivial function which is less complicated than the comparison operator that you would have to write for std::sort.
Unless it uses the same interface (aka call a comparator for each eval) it's apples to oranges. I also didn't see comparisons vs swaps enumerated, which matter for complex data structures or complex comparators.
Watching std:sort in sort sound- i can only wonder, why this second fine granular sorting pass, long after you pushed out the first sorting part out of cache?
> Another problem is that I’m not sure what to do for data that I can’t sort. For example this algorithm can not sort a vector of std::sets
> Another problem is that right now there can only be one sorting behavior per type. You have to provide me with a sort key, and if you provide me with an integer, I will sort your data in increasing order. If you wanted it in decreasing order, there is currently no easy interface to do that.
Useful algorithm, but we already knew there are faster algorithms if we introduce additional requirements!