It's true whenever the function call overhead for the comparison operation dominates. Inlining a trivial operation (often just one instruction) is one extreme.
At the other extreme L1C cache (typically just 32 kB) gets constantly trashed by numerous type specialized functions. Individually fast when microbenchmarked, but collectively slow.
That's why profiling is a must when high performance is required. Microbenchmarks can be very misleading.
Why would all template specializations for be in L1?
Even assuming you actually have multiple specializations in the same binary, which is not a given, only the one being in use or that it will be used very soon will be in L1. Instruction cache prefetching is extremely effective.