I modified his approach to get something that would compile to the assembly closer to that one would expect regardless of the optimization level, and put my code up here: https://gist.github.com/nkurz/d64f5b4ded4e19e17aae
From source code like this:
COMPILER_NO_INLINE uint64_t f3( uint64_t x ) {
return f2( f2( x ) ) + 1;
}
'gcc -O3' took 3.67 cycles per call from assembly like this on both Haswell and Sandy Bridge:
Since 'inc' shouldn't be any slower than 'add', I think this slowdown is either due to alignment or the greater density calls in the code. Maybe this is a case where the denser code needs to use the legacy decoder instead of the µop cache?
'icc -O3' produced strange code that ran slightly slower, at 4.5 cycles per call on Haswell, and 4.01 cycles on Sandy Bridge:
Can anyone tell me what it's doing with those extra pushes and pops? I think they are just noise, but maybe they help with debugging or stack alignment?
Clang also produces odd code with extra pushes and pops, but somehow achieves the same speed as GCC on Haswell despite this (3.67 cycles). Speed on Sandy Bridge was slow at 4.5 cycles:
From source code like this:
'gcc -O3' took 3.67 cycles per call from assembly like this on both Haswell and Sandy Bridge: 'gcc -Os' was significantly slower, at 4.25 cycles per call on Haswell, 4.71 on Sandy Bridge: Since 'inc' shouldn't be any slower than 'add', I think this slowdown is either due to alignment or the greater density calls in the code. Maybe this is a case where the denser code needs to use the legacy decoder instead of the µop cache?'icc -O3' produced strange code that ran slightly slower, at 4.5 cycles per call on Haswell, and 4.01 cycles on Sandy Bridge:
Can anyone tell me what it's doing with those extra pushes and pops? I think they are just noise, but maybe they help with debugging or stack alignment?Clang also produces odd code with extra pushes and pops, but somehow achieves the same speed as GCC on Haswell despite this (3.67 cycles). Speed on Sandy Bridge was slow at 4.5 cycles: