> Has anyone quantified the penalty in using register renaming to allow for a split register file in high-performance implementations while allowing compatible low-power implementations with a single register file? Maybe the transistor savings are dwarfed by other concerns in a modern hardware implementation, but given RISC-V's academic goals, economy for FPGA implementation seems useful.
I'm not sure I fully get your point. A low power implementation would have one physical register for each register in the instruction set (ie. for RISC-V it's already split). If you have a unified file in the instruction set then that means that you're going to have fewer GPRs available (ok for RISC-V as 16 is enough really) and you either need unified in your OOO implementation or you need to make integer->floatingpoint forwarding an exception (at which point you'd be better just to split 16/16).
It's rare to have one function that uses both a lot of integer operations and a lot of fp operations. My understanding is that there are two advantages to having split gp and fp register files: (1) it is advantageous to not have integer/address calculations and fp operations not contend with each other over register file read ports (2) you get twice as many visible (architectural) registers without taking up more bits in the instruction stream.
My understanding is that the designers of both POWER and aarch64 did a bunch of research before settling on 32 for the number of gp registers, and there are some workloads where 32 are really needed, so having a fixed split is suboptimal for some workloads. Aarch64 isn't that old, and I'm not aware of major improvements in compiler register allocation algorithms since it was designed. Are you contending that ARM made a mistake in their analysis when expanding from 16 to 31/32 gp registers for aarch64?
So, my assumptions are (1) optimal split between fp and gp depends on workload (2) splitting them has a cost in die area for low-power implementations and (3) high performance chips can use the register renaming hardware to get the extra register file read ports from split register files and effectively tailor the gp/fp split to the workload. You wouldn't need to make integer-fp forwarding an exception, though the forwarding might have latency similar to an L1 cache read. Presumably, the register renaming hardware would keep a single bit (or small 2-3 bit saturating counter) to keep track of the last usage of each register, so that load instructions were likely to be stored to the optimal register file.
It just seems that if you're going for a brand new green field architecture design, you can get rid of the cost of a split register file for low-power implementations and get all of the benefits of a split register file in high-performance designs that are going to have register renaming hardware anyway.
Maybe I'm missing something with regard to your comment about integer-fp forwarding needing to be an exception instead of incurring slight latency.
I guess my comment about 16 registers being enough was more in the context of low power. In benchmarks I've seen going from 8->16 is a big boost, but going from 16->32 is generally only a few %. In ARM64 32 is definitely the right choice but in low power 16 is probably going to be best because with 32 your register file ends up taking up >50% of the die area and having 32 prevents you from fitting instructions into 16 bits (3 registers now uses 15 of your 16 bits, with 16 regs you have 4 bits leftover).
On the register file split I think you're generally right, it just comes down to what the penalties actually look like from having the unified file. I think the loads are a cause of trouble, really you need two different types of loads rather than having a predictor I think (low-power implementations could ignore that single bit). It's really just a matter of those register file read ports, you are going to need a lot more or need less backends or get bottlenecked on read ports (remember not everything gets forwarded, much is already retired).
One thing that's also relevant to the discussion is almost always when you add FP to a CPU these days you also add some kind of vector ops, and given vector registers are not the same length as the GPs you can't have a unified file. If I was doing green field I would probably go for a split file, but then say low power implementations shouldn't implement FP/vector at all (which is what RISC-V has done), unfortunately although this idea is popular with hardware designers it is not with software developers who are used to being able to drop a 'float' or 'double' into their C code and have it work (or worse may use a library that includes floats in its internals).
I'm not sure I fully get your point. A low power implementation would have one physical register for each register in the instruction set (ie. for RISC-V it's already split). If you have a unified file in the instruction set then that means that you're going to have fewer GPRs available (ok for RISC-V as 16 is enough really) and you either need unified in your OOO implementation or you need to make integer->floatingpoint forwarding an exception (at which point you'd be better just to split 16/16).