Usually people say std::sort is faster than qsort

vardump · on Dec 21, 2020

It's true whenever the function call overhead for the comparison operation dominates. Inlining a trivial operation (often just one instruction) is one extreme.

At the other extreme L1C cache (typically just 32 kB) gets constantly trashed by numerous type specialized functions. Individually fast when microbenchmarked, but collectively slow.

That's why profiling is a must when high performance is required. Microbenchmarks can be very misleading.

gpderetta · on Dec 22, 2020

Why would all template specializations for be in L1?

Even assuming you actually have multiple specializations in the same binary, which is not a given, only the one being in use or that it will be used very soon will be in L1. Instruction cache prefetching is extremely effective.

flqn · on Dec 21, 2020

It is, since the compiler can inline the comparator.