Keep in mind that most of the details you mentioned are for the developer of the...

vardump · on Dec 21, 2020

"nearly zero overhead"

I can't call anything "nearly zero overhead" that can generate large amounts of code. Not at least with a pure conscience.

I've seen enough many cases of better performance with more "overhead" but drastically smaller code size.

flqn · on Dec 21, 2020

Code size is not always a good performance metric

vardump · on Dec 21, 2020

Nothing is a good performance metric apart from profiling. Preferably with runtime conditions (CPU load, memory bandwidth, etc.) closely mirroring production setting(s).

It's very easy to forget things like memory bandwidth, inter-core/CPU links, etc. are all limited resources. And that on real systems resources are shared.

benibela · on Dec 21, 2020

I usually profile everything with valgrind callgrind

benibela · on Dec 21, 2020

Usually people say std::sort is faster than qsort

vardump · on Dec 21, 2020

It's true whenever the function call overhead for the comparison operation dominates. Inlining a trivial operation (often just one instruction) is one extreme.

At the other extreme L1C cache (typically just 32 kB) gets constantly trashed by numerous type specialized functions. Individually fast when microbenchmarked, but collectively slow.

That's why profiling is a must when high performance is required. Microbenchmarks can be very misleading.

gpderetta · on Dec 22, 2020

Why would all template specializations for be in L1?

Even assuming you actually have multiple specializations in the same binary, which is not a given, only the one being in use or that it will be used very soon will be in L1. Instruction cache prefetching is extremely effective.

flqn · on Dec 21, 2020

It is, since the compiler can inline the comparator.

vardump · on Dec 21, 2020

Sorry for answering my own reply.

Zero overhead cult is pretty strong. People should profile instead. Sigh.