16 years, I created a profiling macro system on for select set critical functions. The profiling system can switch the measuring data from #instruction, #cpu_clk, #branch_stall_cycles, #L1_miss, #L2 miss for all the critical path functions.
After analyzed the data, I found branch stall cycles and data access stall cycles were causing huge number of delays in the critical code path.
I used the following tricks to get rid of the stall cycles.
1) Use branch_likely to force gcc to make sure there is no branch at all in the critical path of executions. (save 30+ cpu cycle per branch, there are a lot of branch stall cycles if one just simplely follow the gcc generated "optimized" code. MIPS CPU 200Mhz)
2) Use prefetch ahead of data structure access to get rid of un-cache data delay. (save ~50+cpu cycle per data stall, also, there are lot of them in the critical path.)
3) Use inline functions, etc to get rid of call stalls in critical path.
The system got ~100x increase on the overall system thru-put with those techniques with just pure C optimization from standard -O2 build.
I think it might be possible to create a build system that can automatically collect the profiling data (branch stall cycles and data stall cycles) and use the branch likely and prefetch instructions to auto-optimized the critical path code.
Specifying which code path / function call sequences are the real critical path probably still require programmer's touch.
As result of using data prefetch code in proper place, I don't used cache locking nor doing any kind of CPU affinity trick to generated the optimized obj code without any stall cycles for critical code path.
> I think it might be possible to create a build system that can automatically
> collect the profiling data (branch stall cycles and data stall cycles)
> and use the branch likely and prefetch instructions to auto-optimized the
> critical path code.
Recent versions of Clang and GCC actually support profile guided optimizations using gcov format files to trace executions. There are a number optimizations that the compiler can determine profitable knowing how something typically executes. They don't use sampled or simulated metrics though, and I'm not sure if they add prefetch or non-temporal optimizations.
The ones you mentioned aren't exactly equivalent to cache partitioning (etc...) though. Partitioning allows explicit allocation and prioritization of a contended resource instead of only improving utilization for a single process. So, for example if two threads/processes have an 8MiB working set, and they run on the same cpu, they can easily step on eachother's data in the cache. If you partition it though, less frequently used data doesn't have to step on more frequently used data.
A lot of those optimizations would no longer yield any benefits[0]. The CPU archictecture evolved a lot in 16 years, especially in branch/code prediction to the point where a correctly predicted branch (without branch_likely) has almost no cost.
As a CPU architect, I can confirm that all those except possibly 2) will not yield significant benefits. Prefetching hints will only be useful when the particular code fragment is highly memory-bound because most wide superscalar microarchitectures will easily hide L1/L2 miss latencies.
My qp trie code <http://dotat.at/prog/qp/> got a performance boost of about 20% by adding prefetch hints in the obvious places. The inner loop is fetch / compute / fetch / compute, chaining down into the true. The next fetch will (usually) be some small offset from a pointer we can get immediately after its preceding fetch, so prefetch the base pointer, then compute to work out the exact offset.
If a DDR stall is 50+ cpu cycles, (probably a lot more with today's 2, 3GHz CPU), I am not sure if superscalar microarchitectures would help too much.
At lease in my case of networking packet forwarding app, I had the profiling data to prove that was an issue.
The app code is not that long ~2000 lines of code after clean up. But it have a lot of table looks up (DDR stall) and branches for error condition checks.
A MIPS is probably the exact opposite to modern (which actually means anything P6 and above) x86 CPUs in terms of performance characteristics. If I were to guess what member of the x86 family might actually benefit from such optimisation, it would be NetBurst (which itself has very different performance characteristics from every other x86 family that came before or after it.)
I was trying to optimize for a network app. The goal of trying to get to 1 million pps. At that time 200Mz CPU, 1 cache miss is 50+ cycles. or 25% of the CPU budget, prefetch helped a lot in that case.
I've occasionally wondered how long it takes highly optimized C/C++ to be surpassed by optimizing compilers due to CPU advancement and the optimizations either making compiler optimization harder, or the optimizations target assumptions about CPU architecture that are no longer valid.
That is, what is the shelf life of a very low level CPU optimization for Intel hardware.
While it seemed to cover more of the compiler optimizations and how to do some low level benchmarking and optimizing and wasn't really addressing when those might become obsolete, it was really interesting and informative. Thanks!
What are you looking for that is not currently part of the profile guided optimization capabilities of current compilers? It has been the case for years that you can compile a program with profiling/tracing hooks, run it through a representative workload, and feed the profiler output back to the compiler to get a faster executable. That's why no Firefox you compile yourself will outperform the one you download from Mozilla, if you only compile it a single time.
You beat me to it. Partitioned caches have been explored for years to defeat covert timing channels. Other measures were used decades ago with varying success. I was excited at the title thinking Intel was finally on some bandwagon for suppressing timing channels given cloud market, security concerns, etc.
Another performance enhancement lol. That's good, too, but damnit I was hoping they mentioned timing channels. It will have to be thoroughly scrutinized before relied on for that but it's a start for sure. Hopefully, more than that. :)
> It’s curious that we have low cache hit rates, a lot of time stalled on cache/memory, and low bandwidth utilization.
That's typical for most workloads! Software is almost never compute or bandwidth bound in my experience, but instead spends most of its time waiting on memory in pointer chasing code. This is especially true for code written in managed languages like Java (since everything typically is boxed and allocated all over the heap).
You don't get high bandwidth utilization by pointer chasing unless you have many threads doing it and you switch threads while waiting on memory. That's true for GPUs, not for typical server workloads running on CPUs.
To me putting high-priority low-latency process and a low-priority high-throughput process on the same machine is a recipe for disaster - they will find some way to get entangled and hurt each other. Kudos to google for pulling this out but the article shows that you need expertise on all levels of the stack to attempt this. Most of us are better off with simply buying more boxes :)
Or we can automate this in an OS/cluster manager and safely this with everyone like we do with some many other technologies that are difficult for most of us to develop :)
One can dream... Of course having this available as an opensource technology would be great but for now it is a piece of proprietary tech inside google.
The SPECint and SPECfp benchmark suites were never very tightly connected to real-world performance, but I wonder what these figures look like for SPECjbb, the java benchmark.
(If I recall correctly, SPECjbb spins up an app that looks a lot like "java pet store" and runs clients against it.)
Never use a single workload as a predictor of performance of the universe of all other workloads. The best benchmark is not a benchmark at all – it is your actual workload.
Regarding cpusets, I wonder how much contention is being created just by software assuming the number of cpus match the number of cpus in the affinity set. For example, C++ has hardware_concurrency now, Python has cpu_count, Java has fork-join, etc...
I'm not necessarily questioning the accuracy, just that they generally don't consider the affinities and a lot of software assumes that number of cpus matches the concurrency available. If only half the cpus are in the affinity set, the processes/threads could generally contend twice as much as possible. I guess depending on how the number is used it could improve the throughput though (e.g. blocking on i/o).
Depends on the language used in the spec, but "CPUs available for scheduling" seems like the definition most software should use. However, I suspect most software is built using an interface that returns the total CPU count for the machine.
At the very least you should be aware that these counts usually count cores with hyperthreading twice - and hyperthreading does indeed provide opportunity for increased parallelism but it is noticeably worse than having another separate physical core.
Right. Currently runtime.NumCPU tries to be fancy by looking at the population count of the cpuset mask[1]. However in a hosted environment using containers there's no reason to believe that the cpuset will remain fixed over the life of the process. This can undercount the available CPUs, leaving you with a GOMAXPROCS that is too low.
Anecdotally, it's very often not bad, (and in fact sometimes "good") to over-provision MAXPROCS. We have used as much as 3 to 6x the number of hyperthreaded cores with good results, depending on the workload. This could insulate you against some container changes.
Of all the sources outside of Google, I'd say XKCD is likely to be closer than most. Even if you ignore the fact that Randall (XKCD) spends a good amount of his time estimating things in a rigorous and scientific way that would mean his estimates are actually pretty good, he has the exact background that means people at Google would take his call if he picked up the phone to muse upon the question of how many machines Google has. They might not give specifics, but helping would be a pretty cool thing to do so they'd probably do it.
Do you imagine that the error bars would be so large as to significantly change the order of magnitude? I don't.
Anyway, it's an offhanded comment and its precision is almost completely irrelevant to the article. I don't fault the author for not spending too much time on it.
After analyzed the data, I found branch stall cycles and data access stall cycles were causing huge number of delays in the critical code path.
I used the following tricks to get rid of the stall cycles.
1) Use branch_likely to force gcc to make sure there is no branch at all in the critical path of executions. (save 30+ cpu cycle per branch, there are a lot of branch stall cycles if one just simplely follow the gcc generated "optimized" code. MIPS CPU 200Mhz)
2) Use prefetch ahead of data structure access to get rid of un-cache data delay. (save ~50+cpu cycle per data stall, also, there are lot of them in the critical path.)
3) Use inline functions, etc to get rid of call stalls in critical path.
The system got ~100x increase on the overall system thru-put with those techniques with just pure C optimization from standard -O2 build.
I think it might be possible to create a build system that can automatically collect the profiling data (branch stall cycles and data stall cycles) and use the branch likely and prefetch instructions to auto-optimized the critical path code.
Specifying which code path / function call sequences are the real critical path probably still require programmer's touch.
As result of using data prefetch code in proper place, I don't used cache locking nor doing any kind of CPU affinity trick to generated the optimized obj code without any stall cycles for critical code path.