CMOV a Bad Idea on Out-of-Order CPUs

raymondh · on July 15, 2013

And this from "Intel® 64 and IA-32 Architectures Optimization Reference Manual":

Assembly/Compiler Coding Rule 2. (M impact, ML generality): Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible.

* Do not do this for predictable branches.

* Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch).

* In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine.

* When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.

rayiner · on July 15, 2013

This is one of my favorite Linus posts, because it gives you a great deal of insight into how a modern out of order pipeline works in the context of one specific instruction. It's unfortunate RWT has such a shitty search feature, but it's worth perusing their forums to read Linus's criticisms of IA-64. He points out that the extensive predication in IA-64 makes an out of order implementation much more complicated.

brigade · on July 15, 2013

I never got why cmov was 2 µops (and thus 2 cycle latency) on Intel CPUs. On AMD (and modern ARM), it's 1 µop with 1 cycle latency and can be issued to any ALU. Which makes it a win for a single conditional mov in pretty much anything short of microbenchmarks with 100% predictable branches, as in Linus's test case.

Also setcc is abysmally stupid in leaving the high 3/7 bytes of the register unmodified - what were Intel's engineers smoking?

rayiner · on July 15, 2013

A lot of features of Intel CPUs can be explained by the fact that the Pentium Pro (and basically every Intel CPU after that until I believe Sandy Bridge), uses a basic architecture that supports reading only two input operands for each instruction in a single cycle. CMOV has to read the flags register, the source register, and the old value of the destination register.

See: http://newsgroups.derkeiler.com/Archive/Comp/comp.arch/2013-....

pbsd · on July 15, 2013

> Also setcc is abysmally stupid in leaving the high 3/7 bytes of the register unmodified - what were Intel's engineers smoking?

It makes sense in context. When the instruction was introduced, it made sense to have 8-bit flags instead of overwriting a whole register (remember you only have 7, and we're usually optimizing things for size at this point). You can also do SETcc of the upper 8 bits of a 16-bit register, e.g. SETNZ AH.

It doesn't make sense now, true, but I've never seen it as a large annoyance; you can avoid using SETcc in favor of full-register instructions (excluding CMOV) most of the time anyway.

cwzwarich · on July 15, 2013

Intel has traditionally limited the number of inputs to a uop to 2, although in their more recent microarchitectures with macro-op fusion the fused uops can take 3 inputs. This is a tradeoff in the design of the frontend, since having fewer dependencies per uop simplifies the design and improves area, timing and power.

zurn · on July 15, 2013

Note that Intel added CMOV when they were already on out-of-order CPUs.

The link doesn't show what the context of the discussion is, but its intended application is the case where Linus says it does work:

" - if you KNOW the branch is totally unpredictable, cmov is often good for performance."

You see this kind of data dependent branches in eg. compression code where the unpredictability is inherent.

pjdc · on July 15, 2013

Context: http://thread.gmane.org/gmane.linux.kernel/480224/focus=4803...

nullc · on July 15, 2013

P4 micro-optimization. ... what the heck are people doing here that this is relevant to their interests???? :P

CMOV on current CPUs is quite fine— when it doesn't make a dependency mess, when the alternative is a poorly predicted branch, and when the dual execution is cheap / can be hidden.

oofabz · on July 15, 2013

The article was written in 2007, when Intel's latest CPU was the Core 2 Duo. Lots of people still had P4's back then.