In that example, CPU B's adder can also be clocked twice as fast. If so, it's ge...

In that example, CPU B's adder can also be clocked twice as fast. If so, it's getting twice the work done and using twice the power (ignoring cache misses and the like for the moment). If it's clocked the same as A, it's performance and power usage will be almost the same as A.

Roughly speaking, power used = transistors switching per unit time. Performance should also follow that pretty closely, depending on the efficiency of the design. At some level, you should be able to look at any instruction and find a corresponding number of transistors that need to switch for it to execute.

Deep pipelining keeps more silicon active at any given time, increasing both performance and power consumption. Because of cache misses and the like, efficiency will drop somewhat. Double the stages also doesn't quite equal double the switches per time, for various reasons. Therefore, deeper pipelines = worse performance per watt but better performance per dollar (not sure how well that'll hold in ridiculous cases like Prescott).

From what I heard, Bulldozer only has one more stage than Haswell (15 vs. 14, don't quote me on that) - not nearly enough to account for the differences we see between them.

What I'm noting is that there are many, many more factors at play than just pipelining. In the case of Bulldozer, I've been hearing quite a bit about minor parts that they found needed more work, most notably branch prediction. It sounds like they've got lots of things that will improve performance with no power or die size downsides. The number I saw bandied about for Steamroller was a 30% performance increase. I have some trouble believing it's quite that big, but if they pull it off, that will be an amazing chip for being 32nm. It hints to me that the macroscale architecture is A-OK, and they just screwed up some small but important things.