Call/Return Microbenchmarking and Call Frame Allocation

userbinator · on Nov 29, 2014

Everyone knows that Python's performance is generally pretty bad, but that seems almost comical to me. How can it take more than 400 cycles just to do a procedure call and return?

There's all the instructions in the interpreter's main loop (which probably contains a bunch of call/returns on its own...), plus the cycles needed to read the instructions it's interpreting (including the manual management of the VM's call stack), then throw in a cache miss or two, and 400 cycles starts looking pretty good for an interpreted language.

I'm of the opinion that if call/return time is taking up a significant portion of the total running time of a program, then the problem is with the program's structure and nothing is going to solve it better than fixing the code itself (e.g. don't break it up into so many little pieces that most of the functions don't do much more than call others; but don't inline too much, since that'll increase cache misses.) All the clever trickery with stack/heap allocation to bring down the overhead is interesting, but it can't ever beat the zero cycles of not doing a call at all. As the saying goes, "the fastest way to do something is to not do it at all."

srean · on Nov 29, 2014

> then the problem is with the program's structure

....or with the choice of the programming language. Some styles goes better together with some programming languages. CPython is possibly the worst offender in terms of function call overhead, amplified further by its lack of inlining / macro capabilities.

PyPy makes it better, but were they ever fighting an uphill task !

Python, on retrospect, seems to have been designed almost as a snarky challenge thrown at someone: "You think you are too smart, do you ? Let me see how you can make this go fast. No I want elide tail calls".

lambdanator · on Nov 29, 2014

Why would inlining too much increase cache misses? I've never heard that.

hawkice · on Nov 29, 2014

Anything that increases executable size might/will cause it to make more fetches into memory, if only to get all those bonus instructions.

I'd say this is a pretty minor effect in every scenario I've dealt with. Haskell, in particular, lets you inline very easily (and so many people do if performance is important), and I've only heard of one time where that had a negative consequence, and that was mostly about a missing optimization that you had to fix in a non-removing-inlining kind of way (it had to do with polymorphic strings leading to code inlined at every single string literal, which is extremely silly).

TheLoneWolfling · on Nov 30, 2014

Yep.

Note, however, that, as a special case, inlining of a function only called in one place "should" always be faster.

nkurz · on Nov 29, 2014

Although the numbers given seem close to correct on Haswell and Sandy Bridge (about 4 cycles overhead per function call) it's almost impossible to know whether this was by luck or good design. It's always dangerous to presume that the code you write in C has any bearing on the actual assembly executed by the processor.

Lest one think this is an idle warning, when I tried to verify his results on my machine I found that even when I prevented the inlining of the functions, "gcc-4.9 -O3" persistently optimized away the second function call by generating code like this:

  f2:
    callq  400700 <tree1>
    mov    %rax,%rdi
    jmpq   400700 <tree1>

It called the function as expected the first time, reusing the first argument in %rdi that it was called with. It moved the return value from that first call from %rax to %rdi as expected. But then rather than issuing a second call and creating a new stack frame, the compiler realized that since it was just returning the value without further processing, it could "tail call optimize" by jumping directly into the called function and letting it return directly to the top level.

So unless you carefully verify the generated assembly, you might end up measuring only half the calls you would expect. In this case it might not matter much (jmp and call are about equal in cost) but it's no longer actually measuring call/return overhead as the title says. Using '-O1' gave me the code closer to what one would expect.

nkurz · on Nov 29, 2014

I modified his approach to get something that would compile to the assembly closer to that one would expect regardless of the optimization level, and put my code up here: https://gist.github.com/nkurz/d64f5b4ded4e19e17aae

From source code like this:

  COMPILER_NO_INLINE uint64_t f3( uint64_t x ) {
    return f2( f2( x ) ) + 1;
  }

'gcc -O3' took 3.67 cycles per call from assembly like this on both Haswell and Sandy Bridge:

  0000000000400720 <f2>:
  400720:       e8 db ff ff ff          callq  400700 <f1>
  400725:       48 89 c7                mov    %rax,%rdi
  400728:       e8 d3 ff ff ff          callq  400700 <f1>
  40072d:       48 83 c0 01             add    $0x1,%rax
  400731:       c3                      retq

'gcc -Os' was significantly slower, at 4.25 cycles per call on Haswell, 4.71 on Sandy Bridge:

  00000000004006f3 <f2>:
  4006f3:       e8 ea ff ff ff          callq  4006e2 <f1>
  4006f8:       48 89 c7                mov    %rax,%rdi
  4006fb:       e8 e2 ff ff ff          callq  4006e2 <f1>
  400700:       48 ff c0                inc    %rax
  400703:       c3                      retq

Since 'inc' shouldn't be any slower than 'add', I think this slowdown is either due to alignment or the greater density calls in the code. Maybe this is a case where the denser code needs to use the legacy decoder instead of the µop cache?

'icc -O3' produced strange code that ran slightly slower, at 4.5 cycles per call on Haswell, and 4.01 cycles on Sandy Bridge:

  0000000000400c90 <f2>:
  400c90:       56                      push   %rsi
  400c91:       e8 1a 00 00 00          callq  400cb0 <f1>
  400c96:       48 89 c7                mov    %rax,%rdi
  400c99:       e8 12 00 00 00          callq  400cb0 <f1>
  400c9e:       48 ff c0                inc    %rax
  400ca1:       59                      pop    %rcx
  400ca2:       c3                      retq

Can anyone tell me what it's doing with those extra pushes and pops? I think they are just noise, but maybe they help with debugging or stack alignment?

Clang also produces odd code with extra pushes and pops, but somehow achieves the same speed as GCC on Haswell despite this (3.67 cycles). Speed on Sandy Bridge was slow at 4.5 cycles:

  00000000004005c0 <f2>:
  4005c0:       50                      push   %rax
  4005c1:       e8 da ff ff ff          callq  4005a0 <f1>
  4005c6:       48 89 c7                mov    %rax,%rdi
  4005c9:       e8 d2 ff ff ff          callq  4005a0 <f1>
  4005ce:       48 ff c0                inc    %rax
  4005d1:       5a                      pop    %rdx
  4005d2:       c3                      retq

pjungwir · on Nov 29, 2014

Sort of OT, but I've been writing some malloc-less C lately, where most/all memory is on the stack, and I've been wishing for a way to control the space with a function argument, for example:

    void foo(int x) {
      char c[x]; 
      // . . .
    }

I'm pretty sure there is no way to do that in C, but I'm wondering if a bit of (Intel) assembly could give me something re-useable from a C context. Any ideas?

ajtulloch · on Nov 29, 2014

Doesn't `alloca` do exactly what you want? http://man7.org/linux/man-pages/man3/alloca.3.html

srean · on Nov 29, 2014

Yes and now C99 supports variable length arrays anyway. The storage gets allocated (deallocated) as control enters (exits) the scope. Jumping in I think is not allowed, dunno if that gets compile time checked or results in a run time error. alloca is officially not portable but its present on platforms I care about. It is living dangerously, sure, but quite effective nonetheless.

Would have loved a portable way to query how much stack space the process has left. The problem some architectures need not even have a stack.

@dbaupp Hey thanks ! and nice to know

dbaupp · on Nov 29, 2014

It's a compile time error, e.g.

  void foo(int x) {
    goto label;
    {
        char y[x];
  label:
        y[0] = 'a';
    }
  }

  alloca-goto.c: In function ‘foo’:
  alloca-goto.c:2:5: error: jump into scope of identifier with variably modified type
       goto label;
       ^
  alloca-goto.c:5:1: note: label ‘label’ defined here
   label:
   ^
  alloca-goto.c:4:14: note: ‘y’ declared here
           char y[x];
                ^

stefs · on Nov 29, 2014

out of curiosity i tried this with javascript. seems like chrome is a lot of faster than firefox. ~6 times with f25, but f25 is a bit too coarse. jsperf (which isn't really the right tool for this job anyway) identified me as a spammer while i was adjusting the test cases though. with lower f's the difference would be even higher.

i tried it out in the browser, sum of fN(N) from [0..1000[ yields a ~2-3 times difference.

stefs · on Nov 30, 2014

correction: it was fM(N) where N=[0..1000[ and constant M=20 (i think; 20 didn't take too long for 1000 iterations). results were added up and printed at the end and outside the benchmarking loop.