Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

How do these manage to achieve the 3 cycle latency? If it has to go through a pipeline of 8 (or 64?) adders in a row (I don't know if it's actually that many! but I assume more than 3), how do you still manage to get just a 3 cycle latency, assuming a single adder has a latency of 1?

Also, given the trickery needed for the radix-8 multiply, where you get lucky with most values from 1-7 except 3 and 5 which use this specialized circuit, how much such specialized circuits do you need for radix-256 multiply?



> assuming a single adder has a latency of 1?

Bad assumption. A single adder is only forced to take a full clock cycle if you register its output. Otherwise, data can flow through several stages in a single cycle, as (worst-case) propagation delay allows, before that set of stages hits a register that forces clock alignment.


It's not even adder plus register that takes a clock cycle. Chains of additions are so common that it's pretty much essential for any high performance CPU to run at the maximum clock speed that allows them to proceed at one per clock, so within the clock cycle you need the bypass path from the ALU pipe stage output back to a mux selecting either the bypass (previous instruction's result) or a register, then the adder, then the register.

Depending on the design there might be some other logic that requires a longer path in a single pipe stage, but you wouldn't want it to be too much longer.

Based on modern ISA design I suspect that a 64 bit barrel shifter -- with its 6 layers of muxes -- takes slightly longer than a 64 bit add, and everyone is taking the adder input from after the 2nd layer of shifter muxes, giving you single-cycle instructions to do `a + (b << {0,1,2,3})` either disguised as an LEA (x86, Arm) or as a specific `sh3add` etc instruction (RISC-V).


None of these combined shift-and-add instructions need a full barrel shifter, though, so they? Usually they’re selecting from 2-4 possible shift amounts, not 64 of them.


My suggestion was that all adds go through the first two (of six, on 64 bit) layers of the barrel shifter, not the whole barrel shifter.


Mostly. ARM actually has instructions that allow a full 64-bit shift then add.

But I checked the Cortex A78 optimisation manual. They take 1 cycle if the shift is 4 or less and 2 cycles in other cases.


0-4 shift not 0-3? That is a little bit weird.


Arm64 has fast 128-bit loads. Not just with NEON, but with regular integer instructions, you can quickly load 128 bits into a pair of 64-bit registers.

So it kind of makes sense to support fast shift by four. Though, it's more likely they just profiled a bunch of code and decided fast shifts by four was worth budgeting for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: