daemonwrangler's comments

daemonwrangler · on July 2, 2015

Sort of. The physical register file has more registers than what's specified in the ISA in order to support out-of-order execution. Hardware maps the small number of ISA registers to the larger number of physical registers. All sorts of complex stuff happens in the hardware to make it all work out right (e.g., bypass logic between pipeline stages to make sure dependent instructions are able to use just produced data before it gets stored in the register file). All to give the hardware greater scheduling flexibility.

So yes, there are more registers than you'd think by looking at the ISA, but they aren't available to the compiler, which can limit the kinds of optimizations it can make. I think when AMD introduced x86-64, they only increased ISA registers from 8 to 16. RISC ISAs at the time were offering 32-64 (and some also had larger physical register files to support OoO). Granted these days there's also all of the vector registers for SSE/AVX, but you'll need to have vectorizable code to leverage those.

All that being said, I don't think cache is a replacement for the register file.

daemonwrangler · on July 1, 2015

Also simpler to decode. If you look at how much chip area goes to the frontend decoder in an x86 chip, that's a significant difference.

Someone · on July 1, 2015

I don't think simpler to decode necessitates fewer instructions. If all your instructions put the bits describing which registers to use in the same place, use the same way to specify constants instead of registers, to specify that they operate on floating point variables, etc, using 8 bits that select the instruction gives you 256 different discussion with limited added complexity.

Of course, it is unlikely that you can make 256 instructions use the exact same format (some operations will not have any use for 3 registers, for instance) but if you can keep things as consistent as possible, decoding becomes easier. Price paid is that you sacrifice instruction space, for example because the instruction format allows you to write the result of an operation to a register hard-wired to contain a zero.

And x86 isn't that CISC-y. There were processors that basically could do a simple number to string conversion in one instruction, and there _are_ processors with an instruction that does Unicode conversions (http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/d...)

daemonwrangler · on July 1, 2015

Can you use memory locations as operands in most instructions in the RISC ISAs these days? I always liked that RISC tended to have explicit Load and Store instructions to bring memory values into registers vs. being able to specify a register and a memory address as inputs to an instruction as you can in x86. Decoupling slow memory reads from the op that uses their values gives the CPU more flexibility in scheduling to try and hide the latency of memory ops.

daemonwrangler · on July 1, 2015

Because it's easy to calculate? Too bad it's also utterly meaningless.

daemonwrangler · on July 1, 2015

Something else to keep in mind is that you can get significant power savings when you lower the clock rate. So if you measure total power consumed to run a calculation, it may actually be more efficient to run on a fast CPU, finish quickly, and then drop into a low power state than it would be to run it on a low performance CPU for significantly longer.

dspillett · on July 1, 2015

This is the sort of factor that people forgot to include when testing SSDs for power/performance metrics in the early days of them being within reach of the average home users. An SSD (especially some of the older models) can pull more power than a good spinning metal drive when running as full force, but what some people didn't factor in was that the SDDs did more in a given time especially with latency sensitive workloads - so to do the same work as the traditional drive it would need to run at run pelt for far less time meaning quite a saving in power.

Another thing modern CPUs do as well as slowing down when under light load is to almost turn parts of themselves off when not needed. These are things that any CPU could potentially do though, it isn't a difference between CISC and RISC designs.

stephengillie · on July 1, 2015

I can't find a good reference now, but supposedly the i7 has a set of transistors that calculates if its workload would execute faster on multiple cores, or fewer cores, and can park cores to save heat, and let the electricity be focused into the unparked cores.

Intel's marketing material in 2008 mentioned the number of transistors doing the load calculations was about equal to the number of transistors in a 486. So you have a 486 constantly determining thread scheduling load, they claimed.

wtallis · on July 1, 2015

You misunderstood. The CPU doesn't get to decide how many cores are used; the operating system's scheduler does. The CPU just tries to keep an accurate running estimate of its power consumption and uses that to predict whether it has enough headroom to boost the clock speed above the nominal full speed. If some cores are temporarily idled by the OS, then that frees up a lot of power and allows the remaining cores to have their clock speed boosted further.

stephengillie · on July 1, 2015

Intel's marketing materials helped me misunderstand. Unless the OS is leveraging that logic when it calculates which CPU to park.

Does any OS know that unparked CPU clock speeds might increase when they park a CPU?

wtallis · on July 1, 2015

The operating systems have plenty of knowledge about how CPU power management works. They are hampered somewhat by how things like Turbo Boost are implemented in a backwards-compatible way through ACPI P-states that can't directly convey this information, but it's still pretty straightforward for an OS to support even more complicated schemes like ARM's big.LITTLE.

The real problem is that the OS seldom has enough information about the software workload to know whether it is better run on all cores, or just a few at higher clocks. It falls to application developers to not spawn more worker threads than are necessary.

digi_owl · on July 1, 2015

That may work under a synthetic workload where you know the beginning and end of the "heavy" load.

But i don't know if it holds up in real life scenarios, in particular on multitasking platforms.