Keep in mind that most of the details you mentioned are for the developer of the library, not the end user/consumer of the library.
Ultimately most consumers of the library care if they can call std::sort(YourSpecialContainer.begin(), YourSpecialContainer.end()); in a consistent manner with some semblance that the use of templates and interfaces result in nearly zero overhead of the abstraction.
Nothing is a good performance metric apart from profiling. Preferably with runtime conditions (CPU load, memory bandwidth, etc.) closely mirroring production setting(s).
It's very easy to forget things like memory bandwidth, inter-core/CPU links, etc. are all limited resources. And that on real systems resources are shared.
It's true whenever the function call overhead for the comparison operation dominates. Inlining a trivial operation (often just one instruction) is one extreme.
At the other extreme L1C cache (typically just 32 kB) gets constantly trashed by numerous type specialized functions. Individually fast when microbenchmarked, but collectively slow.
That's why profiling is a must when high performance is required. Microbenchmarks can be very misleading.
Why would all template specializations for be in L1?
Even assuming you actually have multiple specializations in the same binary, which is not a given, only the one being in use or that it will be used very soon will be in L1. Instruction cache prefetching is extremely effective.
Ultimately most consumers of the library care if they can call std::sort(YourSpecialContainer.begin(), YourSpecialContainer.end()); in a consistent manner with some semblance that the use of templates and interfaces result in nearly zero overhead of the abstraction.