And this from "Intel® 64 and IA-32 Architectures Optimization Reference Manual":
Assembly/Compiler Coding Rule 2.
(M impact, ML generality): Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible.
* Do not do this for predictable branches.
* Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch).
* In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine.
* When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.
This is one of my favorite Linus posts, because it gives you a great deal of insight into how a modern out of order pipeline works in the context of one specific instruction. It's unfortunate RWT has such a shitty search feature, but it's worth perusing their forums to read Linus's criticisms of IA-64. He points out that the extensive predication in IA-64 makes an out of order implementation much more complicated.
I never got why cmov was 2 µops (and thus 2 cycle latency) on Intel CPUs. On AMD (and modern ARM), it's 1 µop with 1 cycle latency and can be issued to any ALU. Which makes it a win for a single conditional mov in pretty much anything short of microbenchmarks with 100% predictable branches, as in Linus's test case.
Also setcc is abysmally stupid in leaving the high 3/7 bytes of the register unmodified - what were Intel's engineers smoking?
A lot of features of Intel CPUs can be explained by the fact that the Pentium Pro (and basically every Intel CPU after that until I believe Sandy Bridge), uses a basic architecture that supports reading only two input operands for each instruction in a single cycle. CMOV has to read the flags register, the source register, and the old value of the destination register.
> Also setcc is abysmally stupid in leaving the high 3/7 bytes of the register unmodified - what were Intel's engineers smoking?
It makes sense in context. When the instruction was introduced, it made sense to have 8-bit flags instead of overwriting a whole register (remember you only have 7, and we're usually optimizing things for size at this point). You can also do SETcc of the upper 8 bits of a 16-bit register, e.g. SETNZ AH.
It doesn't make sense now, true, but I've never seen it as a large annoyance; you can avoid using SETcc in favor of full-register instructions (excluding CMOV) most of the time anyway.
Intel has traditionally limited the number of inputs to a uop to 2, although in their more recent microarchitectures with macro-op fusion the fused uops can take 3 inputs. This is a tradeoff in the design of the frontend, since having fewer dependencies per uop simplifies the design and improves area, timing and power.
P4 micro-optimization. ... what the heck are people doing here that this is relevant to their interests???? :P
CMOV on current CPUs is quite fine— when it doesn't make a dependency mess, when the alternative is a poorly predicted branch, and when the dual execution is cheap / can be hidden.
Assembly/Compiler Coding Rule 2. (M impact, ML generality): Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible.
* Do not do this for predictable branches.
* Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch).
* In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine.
* When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.