Hacker News new | past | comments | ask | show | jobs | submit login
A cache miss is not a cache miss (larshagencpp.github.io)
126 points by ingve on May 1, 2016 | hide | past | favorite | 39 comments



That's a nice demonstration. I would even go a bit further with the conclusion/lesson: Cache misses are not evil.

When data isn't in a register and isn't in the L1 cache, it takes a lot of time to fetch it from other caches or about 200 clock cycles if it comes all the way from main memory. We measure that event as a cache miss. But modern x86 processors will go to great lengths to execute the rest of the program while waiting for the data to arrive. A cache miss only really slows the program down if there aren't enough nearby instructions that can be executed while waiting for the missing value to arrive.

You could likely write a program that triggers a cache miss every 30 clock cycles but runs at the same speed as a program without cache misses. In a different program, a cache miss every 30 clock cycles can mean a slowdown by two orders of magnitude. Cache misses are only a useful metric to give us an idea where to look, not to show actual problems.


Cache misses are not evil

...unless they're instruction cache misses, which can literally cause the CPU to run out of instructions to execute.


There is only so much work to go around, before cache misses turn into waiting games, even with branch prediction and micro-code piping.

What is your suggested alternative metric?


Worth noting that modern processors aren't just pipelined; each core is composed of multiple pipelined units operating in parallel. Using the Tomasulo algorithm and derivatives, modern processors can execute a ton of instructions before instructions that appear before them are able to complete. The biggest issue is a branch that depends on a slow operation, which forces you to throw out tons of work if you mispredict.


Could you define slow operations? I am trying to better imagine a CPU having to throw out work where it is basically trying to get from a cache-miss (failure) to cache-miss (failure) as fast as it can.


I think what wyager meant is that on a mispredicted branch you have a CPU pipeline already filled with data and partial results (scheduled loads, r/w register conflicts, etc.) Now that we're at close to 30 stages per instruction, flushing all those partial results because your `if` went in a different branch than usual can get costly. In practice, after a branch misprediction you'll have to wait at least [pipeline_stages] cycles until you see the next instruction run. But if the scheduled instruction was something complex that comes from SSE / AES-NI / etc., you may have to wait even longer.


The real thing we care about is "wasted processor cycles", as well as their source. We measure some sources (branch mispredictions, instruction cache thrashing etc.) and some potential sources (e.g. cache misses). What we lack is a metric how bad each instance is. Not every branch misprediction has the same cost. With cache misses the cost can be nearly zero or very high. It would be nice to be able to measure (or simulate or estimate) the real magnitude of each problem.

As long as we don't have that, cache misses are a useful metric on their own, as long as one is aware of its caveats. As most things, cache misses come in various shades of grey.


So you suggest cache misses corrected by a instruction-workload bias (which should be again biased by how "hot" the instructions remain)?

EvilOfCacheMiss = TimeOfMemoryFetchCycles - CyclesSpendDoingInstructions /TimeOfMemoryFetchCycles


More like TimeOfMemoryFetchInCycle - CyclesSpentDoingInstructions

But yeah, sounds like a useful metric to me.


This discussion reminded me of a formula for calculating the average cost of a cache miss in this pretty cool paper ("An Analysis of the Effects of Miss Clustering on the Cost of a Cache Miss").

In particular, see Equation 4: http://researcher.ibm.com/files/us-viji/miss-cluster.pdf


Good article. A few years ago I wrote a toy program which demonstrates cost of cache misses (or the power of hardware prefetchers). https://github.com/WojciechMula/toys/tree/master/cache_test


> toy program which demonstrates cost of cache misses

never realized 'or' is a keyword... :)


It reminds me of this article, where adding prefetching to attempt to tell the CPU where to go next for a liked-list operation actually had a slightly detrimental effect on performance: https://lwn.net/Articles/444336/


Cache impact of linked lists can be mitigated via cdr coding, originally developed to save RAM: https://en.wikipedia.org/wiki/CDR_coding


Very cool. Herb Sutter gave a talk [1] a couple years ago in which he advocated being mindful of the CPU prefetcher when dealing with large vectors, esp. vectors of pointers. Essentially as long as memory is being accessed in a predictable manner the pre-fetcher can keep the processor fed.

An aside, because instructions are accessed sequentially (branches aside), I would intuit that data misses are much more common than instruction misses. If this is incorrect I'd be interested in an explanation or further reading.

[1] https://channel9.msdn.com/Events/Build/2014/2-661


If you want to measure cache miss effects, you really should rewrite this code in C, so you know exactly what gets allocated in memory with what layout.

Without more details on data layout, you could be seeing the side-effect of aggressive pointer prefetching, which may only be possible with one layout but not another. Or there could be a fixed stride for some of the data allowing prefetchers to kick in. It's hard to tell and deserves some more experiments to isolate what is going on.


The only type used in the demos is vector<pair<int,int>>. Vectors are guaranteed to be contiguous in memory. I'm pretty sure pairs are too, but that's easy to test:

    #include "stdio.h"
    #include <utility>
    printf("%zu\n", sizeof(std::pair<int, int>));

    8
So it's the obvious layout you would see in C with an array of struct { int x; int y }.

The cling interpreter is good for this kind of thing.


20.2.2 Pairs &para; 1 (ISO/IEC 14882:2003):

  The library provides a template for heterogeneous pairs of values. The library also provides a matching
  function template to simplify their construction.

      template <class T1, class T2>
      struct pair {
        typedef T1 first_type;
        typedef T2 second_type;
      
        T1 first;
        T2 second;
        pair();
        pair(const T1& x, const T2& y);
        template<class U, class V> pair(const pair<U, V> &p);
      };
      
      pair();
tl;dr: It looks like the 03 standard requires it to be equivalent to the plant C version in terms of layout, although that could involve padding between first and second if e.g. T1 = short and T2 = int, just like in the equivalent C code.


Is the comparison really fair? On a 64-bit system, the `std::unique_ptr<int>` takes 8 bytes, while a linked list node takes at least 24 bytes (as it must have a previous and next pointer). The larger sizes means less number of values can be in the cache at any time and may be the real reason for the slowdown.


This is the reason that the problem was simplified to using vector<pair<int,int>> for all the experiments, so no difference in data size.


Sets a good example to MEASURE things rather than just relying on "old wives' tales" of programming that everyone believes because everyone believes them. So many times, I've asked someone why they used some weird algorithm, why they wrote their own linked list, or why they unrolled that loop, and it's always "But it's faster!" Well, did you measure it to actually see if it's faster?

One nit: Please, please, stop perpetuating the use of the non-word "performant." I cringe every time I hear it (now even in person at work!) Using it just makes you sound dumb, and clearly the author is not dumb.


>> stop perpetuating the use of the non-word "performant." I cringe every time I hear it... Using it just makes you sound dumb, and clearly the author is not dumb.

You cringe at the use of a word that is now common in our vernacular? The fact that it's not listed in most dictionaries does not mean the word is not valid. Dictionaries aren't written by all-knowing gods who dictate our languages to us. They are sourced by people, and we frequently opt to add new words to the English language. The more we see people continuing to make use of the word "performant" to mean what we all mean when we use it, the more likely it becomes that we will simply officially adopt it into our dictionaries.

If you understand the meaning behind a "non-word", then the word is already a part of the language. Language is fluid, and the appearance of new words is normal. You can grip onto your dictionary like a bible, while looking down your nose at others who are flexible when change comes knocking. It doesn't make us look dumb; it just makes you appear pretentious.

Aside: "performant" is borrowed from French, which also has "efficace" as a direct translation for "efficient". So there is room for an additional word. ;)


Don't read Shakespeare. He made up gobs of useful words.


Or committed a lot of the vernacular to print, when that was something that wasn't done.


Changed to "high performance" :) I don't have a strong opinion myself, as english is not my primary language, but I appreciate the input.


Cool, and sorry for venting my pet annoyance on HN.


Complaining that ubiquitous jargon is a "non-word" makes you look... something.


Performant is totally a word!

In french. What do you have against french?


>Please, please, stop perpetuating the use of the non-word "performant."

Seriously it makes me cringe every time someone says it.


I think this is the reasoning behind preferring SOA vs AOS.


I view SoA vs AoS (with SoA frequently being more performant) as a choice to maximise the efficiency of the cache, while this article is more about how reducing data dependency makes caches less important. Two different means to solve the same problem.


AOS == Array of Structures

SOA == Structure of Arrays


Oops, meant to reply to the sibling comment confusing the TLAs with web domain definitions.


How are Service-Oriented Architectures and... Architecture-Oriented Services(?) relevant to cache misses? There's more locality in the former, because each service is more specific?


Structure of arrays vs. array of structures. I.e. Having one struct that contains several arrays as its members, vs. having one array that contains several structs as its elements. The former keeps like data together, so when you need to, for example, iterate over the position vectors of a bunch of objects in a 3D graphics program, all the position vectors for all the objects are tightly packed in the same array, increasing cache efficiency.


I've always called those "parallel arrays" --- and they were around long before Service Oriented Architectures: https://en.wikipedia.org/wiki/Parallel_array


The real trouble is the inability of chips to identify what haskell could completely avoid, doing the same algorithm for the same input repeatedly. Hashing over input skip and save output, that would be a optimization. Also Tools that give Programmers a DataOrientated-VisualFeedback of the OO-Cluttercosts they impart on there work.

A flag that guarantees side effect freedom to a set of operations and suboperations, for the processor that would be great.


Hashing is expensive.


Not as expensive as a cache miss.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: