What happened to the dedicated "times three" multiplier in later machines? Did some form of it stick around? Did they change tactics to something making it obsolete?
You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.
The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.
The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.
The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).
In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.
In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.
As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.
In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.
Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.
Yeah, you are right... I've misunderstood radix-8 multiplication, missed that this post was only talking about a small part of the Pentium's multiplier. and jumped to conclusions... And annoyingly, hackernews doesn't allow you to edit comments after a few hours
On the R3000/R4000/R4200, the 3-bits-per-cycle multipliers do use radix-8 multiplication, but they don't have a dedicated 3x multiplier. Instead the the 3x result is latched during the first cycle (by adding (x << 1) + x). For the remaining cycles it can do a 3-bit multiplication with nothing more than a bit of booth recoding logic, and a single 64-bit wide adder.
Then MIPS entirely abandoned this radix-8 encoding for the 8-bit-per-cycle multiplier in the R4400 and R4300, replacing it with a simple array of binary CSA adders. Probably because an array of base-2 adders is just much simpler. (Or at least that's what I think I can see on the R4300's die shot, I'm going to need to go back and take a much closer look at the multiplier)
Anything I say about radix-256 in my first comment is probably nonsense, it's not radix-256 simply because it can do 8-bits in one cycle.
What I missed is there is nothing limiting you to one radix-8 addition per cycle (like the early MIPS designs), you can combine the radix-8 encoding with an array of adders. And you only need 1/3rd the adders that a base-2 multiplier would need. The entire point of using the radix-8 encoding is that there is only one carry every 3 bits.
You are probably right. This trick with the dedicated 3x multiplier is probably still used today.
> Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.
How do these manage to achieve the 3 cycle latency? If it has to go through a pipeline of 8 (or 64?) adders in a row (I don't know if it's actually that many! but I assume more than 3), how do you still manage to get just a 3 cycle latency, assuming a single adder has a latency of 1?
Also, given the trickery needed for the radix-8 multiply, where you get lucky with most values from 1-7 except 3 and 5 which use this specialized circuit, how much such specialized circuits do you need for radix-256 multiply?
Bad assumption. A single adder is only forced to take a full clock cycle if you register its output. Otherwise, data can flow through several stages in a single cycle, as (worst-case) propagation delay allows, before that set of stages hits a register that forces clock alignment.
It's not even adder plus register that takes a clock cycle. Chains of additions are so common that it's pretty much essential for any high performance CPU to run at the maximum clock speed that allows them to proceed at one per clock, so within the clock cycle you need the bypass path from the ALU pipe stage output back to a mux selecting either the bypass (previous instruction's result) or a register, then the adder, then the register.
Depending on the design there might be some other logic that requires a longer path in a single pipe stage, but you wouldn't want it to be too much longer.
Based on modern ISA design I suspect that a 64 bit barrel shifter -- with its 6 layers of muxes -- takes slightly longer than a 64 bit add, and everyone is taking the adder input from after the 2nd layer of shifter muxes, giving you single-cycle instructions to do `a + (b << {0,1,2,3})` either disguised as an LEA (x86, Arm) or as a specific `sh3add` etc instruction (RISC-V).
None of these combined shift-and-add instructions need a full barrel shifter, though, so they? Usually they’re selecting from 2-4 possible shift amounts, not 64 of them.
Arm64 has fast 128-bit loads. Not just with NEON, but with regular integer instructions, you can quickly load 128 bits into a pair of 64-bit registers.
So it kind of makes sense to support fast shift by four. Though, it's more likely they just profiled a bunch of code and decided fast shifts by four was worth budgeting for.
This is brilliant enough to be included in schtiffs books/printouts. I knew the R4400 was a big leap in FP, but had never heard of the R4200. I would have liked to see code compile on it.
Was there a difference in node size?
Like the M603e that became the G3, it would have been great to see 4 of these R4200s as a quad processor server.
As speed is paramount, there are many more things we will not learn about until the architecture is vintage.
It's sad that speculative execution became a vulnerability.
It's an interesting performance comparison, the R4200 is well under half the size and actually has a higher IPC than the R4400 on integer code... when it hits the L1 cache (and both have 16KB L1i caches).
The early R4400s and R4200 where on the same node (I don't have 0.6 μm die area for the 4400, but the 0.3 μm R4400 was 132 mm² and the 0.35 μm R4300 was 45mm²)
But the R4400 could hit much higher clock speeds, it had an eight stage pipeline, took three cycles to take any branch (the delay slot, then two flushed instructions), and the 64-bit shifter took 2 cycles. The R4200 was designed for much lower clock speeds, it's more or less a 5 stage "classic RISC" pipeline. Branches take 1 cycle (the delay slot) and 64 bit shifts happen in 1 cycle.
The other problem is when it doesn't hit cache. The R4400 had an on-chip controller and on-chip tags for an external 1MB secondary cache (aka, L2), while the R4200 only had half the data cache (8KB vs 16KB) and no support at all for a secondary cache (not enough space for the IO pins). AFAIK, it doesn't even broadcast cache flushes on the bus, so attempting to bolt an external cache controller would be problematic.
(Cache misses on are especially painful on the N64, because the R4300 uses a cut-down 32-bit bus, and there is a lot of latency and bus contention going to main memory)
And to kill your dreams of putting four of them in a single server, they removed all multi-processor support.
That last one feels semi-deliberate to me, like they didn't want their new low-power, low-cost laptop chip to compete with their more expensive chips. I don't think it would have taken that much effort to implement the full cache coherency protocol.
Because I do think the R4200 could have competed on certain memory bound workloads (at least in cost-per-performance metrics).