I think that packed SIMD is better in almost every aspect and Vector SIMD is wor...

dzaima · 2025-04-25T19:03:43 1745607823

> Open source software is recompiled every week anyway.

Despite being potentially compiled recently, anything from most Linux package managers, and whatever precompiled downloadable executables, even if from open-source code, still targets the 20-year-old SSE2 baseline, wasting the majority of SIMD resources available on modern (..or just not-extremely-ancient) CPUs (unless you're looking at the 0.001% of software that bothers with dynamic dispatch; but for that approach just recompiling isn't enough, you also need to extend the dispatched target set).

RISC-V RVV's LMUL means that you get unrolling for free, as each instruction can operate over up to 8 registers per operand, i.e. essentially "hardware 8x unrolling", thereby making scalar overhead insignificant. (probably a minor nightmare from the silicon POV, but perhaps not in a particularly limiting way - double-pumping has been done by x86 many times so LMUL=2 is simple enough, and at LMUL=4 and LMUL=8 you can afford to decode/split into ups at 1 instr/cycle)

ARM SVE can encode adding a multiple of VL in load/store instructions, allowing manual unrolling without having to actually compute the intermediate sizes. (hardware-wise that's an extremely tiny amount of overhead, as it's trivially mappable to an immediate offset at decode time). And there's an instruction to bump a variable by a multiple of VL.

And you need to bump pointers in any SIMD regardless; the only difference is whether the bump size is an immediate, or a dynamic value, and if you control the ISA you can balance between the two as necessary. The packed SIMD approach isn't "free" either - you need hardware support for immediate offsets in SIMD load/store instrs.

Even in a hypothetical non-existent bad vector SIMD ISA without any applicable free offsetting in loads/stores and a need for unrolling, you can avoid having a dependency between unrolled iterations by precomputing "vlen*2", "vlen*3", "vlen*4", ... outside of the loop and adding those as necessary, instead of having a strict dependency chain.

IshKebab · 2025-04-25T12:42:24 1745584944

Those 4 counter instructions have no dependencies though so they'll likely all be issued and executed in parallel in 1 cycle, surely? Probably the branch as well in fact.

codedokode · 2025-04-25T13:20:00 1745587200

The load instruction has a dependency on counter increment. While with packed SIMD one can issue several loads without waiting. Also, extra counter instructions still waste resources of a CPU (unless there is some dedicated hardware for this purpose).

LoganDark · 2025-04-25T07:55:31 1745567731

This comment sort of reminds me of how Transmeta CPUs relied on the compiler to precompute everything like pipelining. It wasn't done by the hardware.

codedokode · 2025-04-25T07:59:35 1745567975

Makes sense - writing or updating software is easier that designing or updating hardware. To illustrate: anyone can write software but not everyone has access to chip manufacturing fabs.

LoganDark · 2025-04-25T14:40:54 1745592054

Atomic Semi may be looking to change that (...eventually)