Why Intel Added Cache Partitioning

tkinom · on Oct 5, 2015

16 years, I created a profiling macro system on for select set critical functions. The profiling system can switch the measuring data from #instruction, #cpu_clk, #branch_stall_cycles, #L1_miss, #L2 miss for all the critical path functions.

After analyzed the data, I found branch stall cycles and data access stall cycles were causing huge number of delays in the critical code path.

I used the following tricks to get rid of the stall cycles.

1) Use branch_likely to force gcc to make sure there is no branch at all in the critical path of executions. (save 30+ cpu cycle per branch, there are a lot of branch stall cycles if one just simplely follow the gcc generated "optimized" code. MIPS CPU 200Mhz)

2) Use prefetch ahead of data structure access to get rid of un-cache data delay. (save ~50+cpu cycle per data stall, also, there are lot of them in the critical path.)

3) Use inline functions, etc to get rid of call stalls in critical path.

The system got ~100x increase on the overall system thru-put with those techniques with just pure C optimization from standard -O2 build.

I think it might be possible to create a build system that can automatically collect the profiling data (branch stall cycles and data stall cycles) and use the branch likely and prefetch instructions to auto-optimized the critical path code.

Specifying which code path / function call sequences are the real critical path probably still require programmer's touch.

As result of using data prefetch code in proper place, I don't used cache locking nor doing any kind of CPU affinity trick to generated the optimized obj code without any stall cycles for critical code path.

uxcn · on Oct 6, 2015

    > I think it might be possible to create a build system that can automatically 
    > collect the profiling data (branch stall cycles and data stall cycles) 
    > and use the branch likely and prefetch instructions to auto-optimized the 
    > critical path code.

Recent versions of Clang and GCC actually support profile guided optimizations using gcov format files to trace executions. There are a number optimizations that the compiler can determine profitable knowing how something typically executes. They don't use sampled or simulated metrics though, and I'm not sure if they add prefetch or non-temporal optimizations.

The ones you mentioned aren't exactly equivalent to cache partitioning (etc...) though. Partitioning allows explicit allocation and prioritization of a contended resource instead of only improving utilization for a single process. So, for example if two threads/processes have an 8MiB working set, and they run on the same cpu, they can easily step on eachother's data in the cache. If you partition it though, less frequently used data doesn't have to step on more frequently used data.

xavierd · on Oct 6, 2015

A lot of those optimizations would no longer yield any benefits[0]. The CPU archictecture evolved a lot in 16 years, especially in branch/code prediction to the point where a correctly predicted branch (without branch_likely) has almost no cost.

[0]: At least, this is true for x86 CPUs.

e5f34f89 · on Oct 6, 2015

As a CPU architect, I can confirm that all those except possibly 2) will not yield significant benefits. Prefetching hints will only be useful when the particular code fragment is highly memory-bound because most wide superscalar microarchitectures will easily hide L1/L2 miss latencies.

fanf2 · on Oct 8, 2015

My qp trie code <http://dotat.at/prog/qp/> got a performance boost of about 20% by adding prefetch hints in the obvious places. The inner loop is fetch / compute / fetch / compute, chaining down into the true. The next fetch will (usually) be some small offset from a pointer we can get immediately after its preceding fetch, so prefetch the base pointer, then compute to work out the exact offset.

tkinom · on Oct 6, 2015

If a DDR stall is 50+ cpu cycles, (probably a lot more with today's 2, 3GHz CPU), I am not sure if superscalar microarchitectures would help too much.

At lease in my case of networking packet forwarding app, I had the profiling data to prove that was an issue.

The app code is not that long ~2000 lines of code after clean up. But it have a lot of table looks up (DDR stall) and branches for error condition checks.

userbinator · on Oct 6, 2015

Ditto for prefetch instructions:

https://lwn.net/Articles/444336/

A MIPS is probably the exact opposite to modern (which actually means anything P6 and above) x86 CPUs in terms of performance characteristics. If I were to guess what member of the x86 family might actually benefit from such optimisation, it would be NetBurst (which itself has very different performance characteristics from every other x86 family that came before or after it.)

tkinom · on Oct 6, 2015

I was trying to optimize for a network app. The goal of trying to get to 1 million pps. At that time 200Mz CPU, 1 cache miss is 50+ cycles. or 25% of the CPU budget, prefetch helped a lot in that case.

kbenson · on Oct 6, 2015

I've occasionally wondered how long it takes highly optimized C/C++ to be surpassed by optimizing compilers due to CPU advancement and the optimizations either making compiler optimization harder, or the optimizations target assumptions about CPU architecture that are no longer valid.

That is, what is the shelf life of a very low level CPU optimization for Intel hardware.

daemin · on Oct 6, 2015

Well while strictly not on topic, there was this talk recently on micro-optimisations.

https://www.youtube.com/watch?v=nXaxk27zwlk

kbenson · on Oct 7, 2015

While it seemed to cover more of the compiler optimizations and how to do some low level benchmarking and optimizing and wasn't really addressing when those might become obsolete, it was really interesting and informative. Thanks!

matt_d · on Oct 10, 2015

There's been an interesting talk on this a few months ago:

- http://blog.cr.yp.to/20150314-optimizing.html

- (PDF) http://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf

Discussions:

- https://news.ycombinator.com/item?id=9202858

- https://news.ycombinator.com/item?id=9396950

pmalynin · on Oct 5, 2015

So...profiler guided optimization lol.

devbug · on Oct 5, 2015

Yeah. Nothing ground-breaking here.

Would still love to have a great open-source tool for this.

cbsmith · on Oct 6, 2015

You mean like oprofile, dtrace, system tap, cachegrind, etc.?

devbug · on Oct 6, 2015

Ah, I see where you may have got confused.

I meant regarding the automatic application of rules based on profiling.

wtallis · on Oct 6, 2015

What are you looking for that is not currently part of the profile guided optimization capabilities of current compilers? It has been the case for years that you can compile a program with profiling/tracing hooks, run it through a representative workload, and feed the profiler output back to the compiler to get a faster executable. That's why no Firefox you compile yourself will outperform the one you download from Mozilla, if you only compile it a single time.

devbug · on Oct 6, 2015

In my experience, none change the source-code themselves.

cbsmith · on Oct 6, 2015

Feature or bug?

devbug · on Oct 7, 2015

Feature, obviously.

titzer · on Oct 5, 2015

This also helps prevent cache side-channel attacks like the one that led to stealing RSA private keys by probing the L3 miss times.

https://eprint.iacr.org/2015/898.pdf

nickpsecurity · on Oct 6, 2015

You beat me to it. Partitioned caches have been explored for years to defeat covert timing channels. Other measures were used decades ago with varying success. I was excited at the title thinking Intel was finally on some bandwagon for suppressing timing channels given cloud market, security concerns, etc.

Another performance enhancement lol. That's good, too, but damnit I was hoping they mentioned timing channels. It will have to be thoroughly scrutinized before relied on for that but it's a start for sure. Hopefully, more than that. :)

jharsman · on Oct 6, 2015

I find it really weird that the article says:

> It’s curious that we have low cache hit rates, a lot of time stalled on cache/memory, and low bandwidth utilization.

That's typical for most workloads! Software is almost never compute or bandwidth bound in my experience, but instead spends most of its time waiting on memory in pointer chasing code. This is especially true for code written in managed languages like Java (since everything typically is boxed and allocated all over the heap).

paulsutter · on Oct 6, 2015

It's curious that there's low bandwidth utilization despite the low but rates / cpu stalled waiting for memory. Don't you think?

Perhaps random access of small data causes frequent waits without utilizing the bandwidth in a way that block copies would.

jharsman · on Oct 7, 2015

You don't get high bandwidth utilization by pointer chasing unless you have many threads doing it and you switch threads while waiting on memory. That's true for GPUs, not for typical server workloads running on CPUs.

dunkelheit · on Oct 5, 2015

To me putting high-priority low-latency process and a low-priority high-throughput process on the same machine is a recipe for disaster - they will find some way to get entangled and hurt each other. Kudos to google for pulling this out but the article shows that you need expertise on all levels of the stack to attempt this. Most of us are better off with simply buying more boxes :)

kozyraki · on Oct 6, 2015

Or we can automate this in an OS/cluster manager and safely this with everyone like we do with some many other technologies that are difficult for most of us to develop :)

dunkelheit · on Oct 6, 2015

One can dream... Of course having this available as an opensource technology would be great but for now it is a piece of proprietary tech inside google.

hapless · on Oct 5, 2015

The SPECint and SPECfp benchmark suites were never very tightly connected to real-world performance, but I wonder what these figures look like for SPECjbb, the java benchmark.

(If I recall correctly, SPECjbb spins up an app that looks a lot like "java pet store" and runs clients against it.)

Symmetry · on Oct 5, 2015

What I've heard is that you should mostly pay attention to the gcc compilation part of SPEC if you want a good indication of real world performance.

ofrobots · on Oct 6, 2015

Never use a single workload as a predictor of performance of the universe of all other workloads. The best benchmark is not a benchmark at all – it is your actual workload.

uxcn · on Oct 5, 2015

Regarding cpusets, I wonder how much contention is being created just by software assuming the number of cpus match the number of cpus in the affinity set. For example, C++ has hardware_concurrency now, Python has cpu_count, Java has fork-join, etc...

thrownaway2424 · on Oct 5, 2015

Golang also has a totally inaccurate routine for counting the CPUs in the machine.

uxcn · on Oct 5, 2015

I'm not necessarily questioning the accuracy, just that they generally don't consider the affinities and a lot of software assumes that number of cpus matches the concurrency available. If only half the cpus are in the affinity set, the processes/threads could generally contend twice as much as possible. I guess depending on how the number is used it could improve the throughput though (e.g. blocking on i/o).

tedunangst · on Oct 5, 2015

Depends on the language used in the spec, but "CPUs available for scheduling" seems like the definition most software should use. However, I suspect most software is built using an interface that returns the total CPU count for the machine.

uxcn · on Oct 6, 2015

The language is typically vague, if there is any. POSIX also notably never had anything to say in regard to cpu affinity.

nulltype · on Oct 5, 2015

What's the accurate way?

seiji · on Oct 6, 2015

As usual, erlang does it better: http://www.erlang.org/doc/man/erlang.html#system_info_cpu_to... and http://www.erlang.org/doc/man/erlang.html#logical_processors

dunkelheit · on Oct 5, 2015

At the very least you should be aware that these counts usually count cores with hyperthreading twice - and hyperthreading does indeed provide opportunity for increased parallelism but it is noticeably worse than having another separate physical core.

wmf · on Oct 6, 2015

Probably something like hwloc which can be configured to show you only the cpus you're allowed to use.

BTW the same problem also happens for determining available memory in containers.

uxcn · on Oct 5, 2015

To calculate the number of processors available or the optimal number of threads/processes?

nulltype · on Oct 5, 2015

Well specifically it would be cool if it counted the optimal number to set GOMAXPROCS to, since I think that's like the main use of runtime.NumCPU().

thrownaway2424 · on Oct 6, 2015

Right. Currently runtime.NumCPU tries to be fancy by looking at the population count of the cpuset mask[1]. However in a hosted environment using containers there's no reason to believe that the cpuset will remain fixed over the life of the process. This can undercount the available CPUs, leaving you with a GOMAXPROCS that is too low.

1: https://code.google.com/p/go/source/browse/src/pkg/runtime/t...

vessenes · on Oct 6, 2015

Anecdotally, it's very often not bad, (and in fact sometimes "good") to over-provision MAXPROCS. We have used as much as 3 to 6x the number of hyperthreaded cores with good results, depending on the workload. This could insulate you against some container changes.

nulltype · on Oct 6, 2015

Seems like if they did some other metric, it could overcount the number of CPUs instead, right?

What about changing GOMAXPROCS once per minute in a goroutine that calls NumCPU()?

thrownaway2424 · on Oct 5, 2015

Summary: because it makes your computer faster.

thrownaway2424 · on Oct 5, 2015

Geez, tough crowd here. How about this:

Summary: because Google asked for it.

Retr0spectrum · on Oct 5, 2015

Keep digging.

amelius · on Oct 5, 2015

> Xkcd estimated that Google owns 2 million machines

Xkcd doesn't sound like a very convincing source. And at least this estimation should have some error-bars.

onion2k · on Oct 5, 2015

Of all the sources outside of Google, I'd say XKCD is likely to be closer than most. Even if you ignore the fact that Randall (XKCD) spends a good amount of his time estimating things in a rigorous and scientific way that would mean his estimates are actually pretty good, he has the exact background that means people at Google would take his call if he picked up the phone to muse upon the question of how many machines Google has. They might not give specifics, but helping would be a pretty cool thing to do so they'd probably do it.

keeperofdakeys · on Oct 5, 2015

Such a situation did happen, Randall answered how many punch cards it would take to store all of the data in google datacenters (https://what-if.xkcd.com/63/). In response, google sent him punch cards (http://blog.ted.com/using-serious-math-to-answer-weird-quest...).

ajkjk · on Oct 5, 2015

Do you imagine that the error bars would be so large as to significantly change the order of magnitude? I don't.

Anyway, it's an offhanded comment and its precision is almost completely irrelevant to the article. I don't fault the author for not spending too much time on it.

cbsmith · on Oct 6, 2015

Let's just say that when Xkcd made that estimate, it seemed pretty good to a lot of people.