I modified his approach to get something that would compile to the assembly clos...

I modified his approach to get something that would compile to the assembly closer to that one would expect regardless of the optimization level, and put my code up here: https://gist.github.com/nkurz/d64f5b4ded4e19e17aae

From source code like this:

  COMPILER_NO_INLINE uint64_t f3( uint64_t x ) {
    return f2( f2( x ) ) + 1;
  }

'gcc -O3' took 3.67 cycles per call from assembly like this on both Haswell and Sandy Bridge:

  0000000000400720 <f2>:
  400720:       e8 db ff ff ff          callq  400700 <f1>
  400725:       48 89 c7                mov    %rax,%rdi
  400728:       e8 d3 ff ff ff          callq  400700 <f1>
  40072d:       48 83 c0 01             add    $0x1,%rax
  400731:       c3                      retq

'gcc -Os' was significantly slower, at 4.25 cycles per call on Haswell, 4.71 on Sandy Bridge:

  00000000004006f3 <f2>:
  4006f3:       e8 ea ff ff ff          callq  4006e2 <f1>
  4006f8:       48 89 c7                mov    %rax,%rdi
  4006fb:       e8 e2 ff ff ff          callq  4006e2 <f1>
  400700:       48 ff c0                inc    %rax
  400703:       c3                      retq

Since 'inc' shouldn't be any slower than 'add', I think this slowdown is either due to alignment or the greater density calls in the code. Maybe this is a case where the denser code needs to use the legacy decoder instead of the µop cache?

'icc -O3' produced strange code that ran slightly slower, at 4.5 cycles per call on Haswell, and 4.01 cycles on Sandy Bridge:

  0000000000400c90 <f2>:
  400c90:       56                      push   %rsi
  400c91:       e8 1a 00 00 00          callq  400cb0 <f1>
  400c96:       48 89 c7                mov    %rax,%rdi
  400c99:       e8 12 00 00 00          callq  400cb0 <f1>
  400c9e:       48 ff c0                inc    %rax
  400ca1:       59                      pop    %rcx
  400ca2:       c3                      retq

Can anyone tell me what it's doing with those extra pushes and pops? I think they are just noise, but maybe they help with debugging or stack alignment?

Clang also produces odd code with extra pushes and pops, but somehow achieves the same speed as GCC on Haswell despite this (3.67 cycles). Speed on Sandy Bridge was slow at 4.5 cycles:

  00000000004005c0 <f2>:
  4005c0:       50                      push   %rax
  4005c1:       e8 da ff ff ff          callq  4005a0 <f1>
  4005c6:       48 89 c7                mov    %rax,%rdi
  4005c9:       e8 d2 ff ff ff          callq  4005a0 <f1>
  4005ce:       48 ff c0                inc    %rax
  4005d1:       5a                      pop    %rdx
  4005d2:       c3                      retq