I think that packed SIMD is better in almost every aspect and Vector SIMD is worse.
With vector SIMD you don't know the register size beforehand and therefore have to maintain and increment counters, adding extra unnecessary instructions, reducing total performance. With packed SIMD you can issue several loads immediately without dependencies, and if you look at code examples, you can see that the x86 code is more dense and uses a sequence of unrolled SIMD instructions without any extra instructions which is more efficient. While RISC-V has 4 SIMD instructions and 4 instructions dealing with counters per loop iteration, i.e. it wastes 50% of command issue bandwidth and you cannot load next block until you increment the counter.
The article mentions that you have to recompile packed SIMD code when a new architecture comes out. Is that really a problem? Open source software is recompiled every week anyway. You should just describe your operations in a high level language that gets compiled to assembly for all supported architectures.
So as a conclusion, it seems that Vector SIMD is optimized for manually-written assembly and closed-source software while Packed SIMD is made for open-source software and compilers and is more efficient. Why RISC-V community prefers Vector architecture, I don't understand.
> Open source software is recompiled every week anyway.
Despite being potentially compiled recently, anything from most Linux package managers, and whatever precompiled downloadable executables, even if from open-source code, still targets the 20-year-old SSE2 baseline, wasting the majority of SIMD resources available on modern (..or just not-extremely-ancient) CPUs (unless you're looking at the 0.001% of software that bothers with dynamic dispatch; but for that approach just recompiling isn't enough, you also need to extend the dispatched target set).
RISC-V RVV's LMUL means that you get unrolling for free, as each instruction can operate over up to 8 registers per operand, i.e. essentially "hardware 8x unrolling", thereby making scalar overhead insignificant. (probably a minor nightmare from the silicon POV, but perhaps not in a particularly limiting way - double-pumping has been done by x86 many times so LMUL=2 is simple enough, and at LMUL=4 and LMUL=8 you can afford to decode/split into ups at 1 instr/cycle)
ARM SVE can encode adding a multiple of VL in load/store instructions, allowing manual unrolling without having to actually compute the intermediate sizes. (hardware-wise that's an extremely tiny amount of overhead, as it's trivially mappable to an immediate offset at decode time). And there's an instruction to bump a variable by a multiple of VL.
And you need to bump pointers in any SIMD regardless; the only difference is whether the bump size is an immediate, or a dynamic value, and if you control the ISA you can balance between the two as necessary. The packed SIMD approach isn't "free" either - you need hardware support for immediate offsets in SIMD load/store instrs.
Even in a hypothetical non-existent bad vector SIMD ISA without any applicable free offsetting in loads/stores and a need for unrolling, you can avoid having a dependency between unrolled iterations by precomputing "vlen*2", "vlen*3", "vlen*4", ... outside of the loop and adding those as necessary, instead of having a strict dependency chain.
Those 4 counter instructions have no dependencies though so they'll likely all be issued and executed in parallel in 1 cycle, surely? Probably the branch as well in fact.
The load instruction has a dependency on counter increment. While with packed SIMD one can issue several loads without waiting. Also, extra counter instructions still waste resources of a CPU (unless there is some dedicated hardware for this purpose).
Makes sense - writing or updating software is easier that designing or updating hardware. To illustrate: anyone can write software but not everyone has access to chip manufacturing fabs.
With vector SIMD you don't know the register size beforehand and therefore have to maintain and increment counters, adding extra unnecessary instructions, reducing total performance. With packed SIMD you can issue several loads immediately without dependencies, and if you look at code examples, you can see that the x86 code is more dense and uses a sequence of unrolled SIMD instructions without any extra instructions which is more efficient. While RISC-V has 4 SIMD instructions and 4 instructions dealing with counters per loop iteration, i.e. it wastes 50% of command issue bandwidth and you cannot load next block until you increment the counter.
The article mentions that you have to recompile packed SIMD code when a new architecture comes out. Is that really a problem? Open source software is recompiled every week anyway. You should just describe your operations in a high level language that gets compiled to assembly for all supported architectures.
So as a conclusion, it seems that Vector SIMD is optimized for manually-written assembly and closed-source software while Packed SIMD is made for open-source software and compilers and is more efficient. Why RISC-V community prefers Vector architecture, I don't understand.