Hacker News new | past | comments | ask | show | jobs | submit login
Why Intel Added Cache Partitioning (danluu.com)
213 points by dangerman on Oct 5, 2015 | hide | past | favorite | 53 comments



16 years, I created a profiling macro system on for select set critical functions. The profiling system can switch the measuring data from #instruction, #cpu_clk, #branch_stall_cycles, #L1_miss, #L2 miss for all the critical path functions.

After analyzed the data, I found branch stall cycles and data access stall cycles were causing huge number of delays in the critical code path.

I used the following tricks to get rid of the stall cycles.

1) Use branch_likely to force gcc to make sure there is no branch at all in the critical path of executions. (save 30+ cpu cycle per branch, there are a lot of branch stall cycles if one just simplely follow the gcc generated "optimized" code. MIPS CPU 200Mhz)

2) Use prefetch ahead of data structure access to get rid of un-cache data delay. (save ~50+cpu cycle per data stall, also, there are lot of them in the critical path.)

3) Use inline functions, etc to get rid of call stalls in critical path.

The system got ~100x increase on the overall system thru-put with those techniques with just pure C optimization from standard -O2 build.

I think it might be possible to create a build system that can automatically collect the profiling data (branch stall cycles and data stall cycles) and use the branch likely and prefetch instructions to auto-optimized the critical path code.

Specifying which code path / function call sequences are the real critical path probably still require programmer's touch.

As result of using data prefetch code in proper place, I don't used cache locking nor doing any kind of CPU affinity trick to generated the optimized obj code without any stall cycles for critical code path.


    > I think it might be possible to create a build system that can automatically 
    > collect the profiling data (branch stall cycles and data stall cycles) 
    > and use the branch likely and prefetch instructions to auto-optimized the 
    > critical path code.
Recent versions of Clang and GCC actually support profile guided optimizations using gcov format files to trace executions. There are a number optimizations that the compiler can determine profitable knowing how something typically executes. They don't use sampled or simulated metrics though, and I'm not sure if they add prefetch or non-temporal optimizations.

The ones you mentioned aren't exactly equivalent to cache partitioning (etc...) though. Partitioning allows explicit allocation and prioritization of a contended resource instead of only improving utilization for a single process. So, for example if two threads/processes have an 8MiB working set, and they run on the same cpu, they can easily step on eachother's data in the cache. If you partition it though, less frequently used data doesn't have to step on more frequently used data.


A lot of those optimizations would no longer yield any benefits[0]. The CPU archictecture evolved a lot in 16 years, especially in branch/code prediction to the point where a correctly predicted branch (without branch_likely) has almost no cost.

[0]: At least, this is true for x86 CPUs.


As a CPU architect, I can confirm that all those except possibly 2) will not yield significant benefits. Prefetching hints will only be useful when the particular code fragment is highly memory-bound because most wide superscalar microarchitectures will easily hide L1/L2 miss latencies.


My qp trie code <http://dotat.at/prog/qp/> got a performance boost of about 20% by adding prefetch hints in the obvious places. The inner loop is fetch / compute / fetch / compute, chaining down into the true. The next fetch will (usually) be some small offset from a pointer we can get immediately after its preceding fetch, so prefetch the base pointer, then compute to work out the exact offset.


If a DDR stall is 50+ cpu cycles, (probably a lot more with today's 2, 3GHz CPU), I am not sure if superscalar microarchitectures would help too much.

At lease in my case of networking packet forwarding app, I had the profiling data to prove that was an issue.

The app code is not that long ~2000 lines of code after clean up. But it have a lot of table looks up (DDR stall) and branches for error condition checks.


Ditto for prefetch instructions:

https://lwn.net/Articles/444336/

A MIPS is probably the exact opposite to modern (which actually means anything P6 and above) x86 CPUs in terms of performance characteristics. If I were to guess what member of the x86 family might actually benefit from such optimisation, it would be NetBurst (which itself has very different performance characteristics from every other x86 family that came before or after it.)


I was trying to optimize for a network app. The goal of trying to get to 1 million pps. At that time 200Mz CPU, 1 cache miss is 50+ cycles. or 25% of the CPU budget, prefetch helped a lot in that case.


I've occasionally wondered how long it takes highly optimized C/C++ to be surpassed by optimizing compilers due to CPU advancement and the optimizations either making compiler optimization harder, or the optimizations target assumptions about CPU architecture that are no longer valid.

That is, what is the shelf life of a very low level CPU optimization for Intel hardware.


Well while strictly not on topic, there was this talk recently on micro-optimisations.

https://www.youtube.com/watch?v=nXaxk27zwlk


While it seemed to cover more of the compiler optimizations and how to do some low level benchmarking and optimizing and wasn't really addressing when those might become obsolete, it was really interesting and informative. Thanks!



So...profiler guided optimization lol.


Yeah. Nothing ground-breaking here.

Would still love to have a great open-source tool for this.


You mean like oprofile, dtrace, system tap, cachegrind, etc.?


Ah, I see where you may have got confused.

I meant regarding the automatic application of rules based on profiling.


What are you looking for that is not currently part of the profile guided optimization capabilities of current compilers? It has been the case for years that you can compile a program with profiling/tracing hooks, run it through a representative workload, and feed the profiler output back to the compiler to get a faster executable. That's why no Firefox you compile yourself will outperform the one you download from Mozilla, if you only compile it a single time.


In my experience, none change the source-code themselves.


Feature or bug?


Feature, obviously.


This also helps prevent cache side-channel attacks like the one that led to stealing RSA private keys by probing the L3 miss times.

https://eprint.iacr.org/2015/898.pdf


You beat me to it. Partitioned caches have been explored for years to defeat covert timing channels. Other measures were used decades ago with varying success. I was excited at the title thinking Intel was finally on some bandwagon for suppressing timing channels given cloud market, security concerns, etc.

Another performance enhancement lol. That's good, too, but damnit I was hoping they mentioned timing channels. It will have to be thoroughly scrutinized before relied on for that but it's a start for sure. Hopefully, more than that. :)


I find it really weird that the article says:

> It’s curious that we have low cache hit rates, a lot of time stalled on cache/memory, and low bandwidth utilization.

That's typical for most workloads! Software is almost never compute or bandwidth bound in my experience, but instead spends most of its time waiting on memory in pointer chasing code. This is especially true for code written in managed languages like Java (since everything typically is boxed and allocated all over the heap).


It's curious that there's low bandwidth utilization despite the low but rates / cpu stalled waiting for memory. Don't you think?

Perhaps random access of small data causes frequent waits without utilizing the bandwidth in a way that block copies would.


You don't get high bandwidth utilization by pointer chasing unless you have many threads doing it and you switch threads while waiting on memory. That's true for GPUs, not for typical server workloads running on CPUs.


To me putting high-priority low-latency process and a low-priority high-throughput process on the same machine is a recipe for disaster - they will find some way to get entangled and hurt each other. Kudos to google for pulling this out but the article shows that you need expertise on all levels of the stack to attempt this. Most of us are better off with simply buying more boxes :)


Or we can automate this in an OS/cluster manager and safely this with everyone like we do with some many other technologies that are difficult for most of us to develop :)


One can dream... Of course having this available as an opensource technology would be great but for now it is a piece of proprietary tech inside google.


The SPECint and SPECfp benchmark suites were never very tightly connected to real-world performance, but I wonder what these figures look like for SPECjbb, the java benchmark.

(If I recall correctly, SPECjbb spins up an app that looks a lot like "java pet store" and runs clients against it.)


What I've heard is that you should mostly pay attention to the gcc compilation part of SPEC if you want a good indication of real world performance.


Never use a single workload as a predictor of performance of the universe of all other workloads. The best benchmark is not a benchmark at all – it is your actual workload.


Regarding cpusets, I wonder how much contention is being created just by software assuming the number of cpus match the number of cpus in the affinity set. For example, C++ has hardware_concurrency now, Python has cpu_count, Java has fork-join, etc...


Golang also has a totally inaccurate routine for counting the CPUs in the machine.


I'm not necessarily questioning the accuracy, just that they generally don't consider the affinities and a lot of software assumes that number of cpus matches the concurrency available. If only half the cpus are in the affinity set, the processes/threads could generally contend twice as much as possible. I guess depending on how the number is used it could improve the throughput though (e.g. blocking on i/o).


Depends on the language used in the spec, but "CPUs available for scheduling" seems like the definition most software should use. However, I suspect most software is built using an interface that returns the total CPU count for the machine.


The language is typically vague, if there is any. POSIX also notably never had anything to say in regard to cpu affinity.


What's the accurate way?



At the very least you should be aware that these counts usually count cores with hyperthreading twice - and hyperthreading does indeed provide opportunity for increased parallelism but it is noticeably worse than having another separate physical core.


Probably something like hwloc which can be configured to show you only the cpus you're allowed to use.

BTW the same problem also happens for determining available memory in containers.


To calculate the number of processors available or the optimal number of threads/processes?


Well specifically it would be cool if it counted the optimal number to set GOMAXPROCS to, since I think that's like the main use of runtime.NumCPU().


Right. Currently runtime.NumCPU tries to be fancy by looking at the population count of the cpuset mask[1]. However in a hosted environment using containers there's no reason to believe that the cpuset will remain fixed over the life of the process. This can undercount the available CPUs, leaving you with a GOMAXPROCS that is too low.

1: https://code.google.com/p/go/source/browse/src/pkg/runtime/t...


Anecdotally, it's very often not bad, (and in fact sometimes "good") to over-provision MAXPROCS. We have used as much as 3 to 6x the number of hyperthreaded cores with good results, depending on the workload. This could insulate you against some container changes.


Seems like if they did some other metric, it could overcount the number of CPUs instead, right?

What about changing GOMAXPROCS once per minute in a goroutine that calls NumCPU()?


Summary: because it makes your computer faster.


Geez, tough crowd here. How about this:

Summary: because Google asked for it.


Keep digging.


> Xkcd estimated that Google owns 2 million machines

Xkcd doesn't sound like a very convincing source. And at least this estimation should have some error-bars.


Of all the sources outside of Google, I'd say XKCD is likely to be closer than most. Even if you ignore the fact that Randall (XKCD) spends a good amount of his time estimating things in a rigorous and scientific way that would mean his estimates are actually pretty good, he has the exact background that means people at Google would take his call if he picked up the phone to muse upon the question of how many machines Google has. They might not give specifics, but helping would be a pretty cool thing to do so they'd probably do it.


Such a situation did happen, Randall answered how many punch cards it would take to store all of the data in google datacenters (https://what-if.xkcd.com/63/). In response, google sent him punch cards (http://blog.ted.com/using-serious-math-to-answer-weird-quest...).


Do you imagine that the error bars would be so large as to significantly change the order of magnitude? I don't.

Anyway, it's an offhanded comment and its precision is almost completely irrelevant to the article. I don't fault the author for not spending too much time on it.


Let's just say that when Xkcd made that estimate, it seemed pretty good to a lot of people.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: