> Massive pipelined arrays of adders that can execute one 64-bit multiplication ...

addaon · 2025-03-03T15:21:49 1741015309

> assuming a single adder has a latency of 1?

Bad assumption. A single adder is only forced to take a full clock cycle if you register its output. Otherwise, data can flow through several stages in a single cycle, as (worst-case) propagation delay allows, before that set of stages hits a register that forces clock alignment.

brucehoult · 2025-03-04T23:40:29 1741131629

It's not even adder plus register that takes a clock cycle. Chains of additions are so common that it's pretty much essential for any high performance CPU to run at the maximum clock speed that allows them to proceed at one per clock, so within the clock cycle you need the bypass path from the ALU pipe stage output back to a mux selecting either the bypass (previous instruction's result) or a register, then the adder, then the register.

Depending on the design there might be some other logic that requires a longer path in a single pipe stage, but you wouldn't want it to be too much longer.

Based on modern ISA design I suspect that a 64 bit barrel shifter -- with its 6 layers of muxes -- takes slightly longer than a 64 bit add, and everyone is taking the adder input from after the 2nd layer of shifter muxes, giving you single-cycle instructions to do `a + (b << {0,1,2,3})` either disguised as an LEA (x86, Arm) or as a specific `sh3add` etc instruction (RISC-V).

addaon · 2025-03-05T01:17:01 1741137421

None of these combined shift-and-add instructions need a full barrel shifter, though, so they? Usually they’re selecting from 2-4 possible shift amounts, not 64 of them.

brucehoult · 2025-03-05T22:43:31 1741214611

My suggestion was that all adds go through the first two (of six, on 64 bit) layers of the barrel shifter, not the whole barrel shifter.

phire · 2025-03-05T01:48:55 1741139335

Mostly. ARM actually has instructions that allow a full 64-bit shift then add.

But I checked the Cortex A78 optimisation manual. They take 1 cycle if the shift is 4 or less and 2 cycles in other cases.

brucehoult · 2025-03-05T22:45:16 1741214716

0-4 shift not 0-3? That is a little bit weird.

phire · 2025-03-08T02:50:09 1741402209

Arm64 has fast 128-bit loads. Not just with NEON, but with regular integer instructions, you can quickly load 128 bits into a pair of 64-bit registers.

So it kind of makes sense to support fast shift by four. Though, it's more likely they just profiled a bunch of code and decided fast shifts by four was worth budgeting for.