Thank you! I've been waiting for a viable RVV board for a long time. Just ordere...

camel-cdr · 2025-03-09T22:20:02 1741558802

> OOO and even wider RVV registers will then automatically speed things up, without even a recompile.

The problem is that there are some things in RVV where it's unclear how they will perform on high perf OoO cores:

* general choice of LMUL: on in-order cores it's clear that maximizing LMUL without spilling is the best approach, for OoO this isn't clear.

* How will LMUL>1 vrgather and vcompress perform?

* How high is the impact of vsetvli instructions? Is it worth trying to move them outside of loops whenever possible, or is the impact minimal like in the current in-order implementations.

* What is the overhead of using .vx instruction variants, is there additional cost involved in moving between GPRs and vector registers?

* Is there additional overhead when reinterpreting vector masks?

* What performance can we expect from the more complex load/stores, especially the segmented ones.

The LLVM scheduling models give some insight:

* SiFive P670: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

* Tenstorrent Ascalon: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ... (still missing the vector part, but there is supposed to be a PR in the near future)

I'm trying to collect as much info on hardware as I can: https://camel-cdr.github.io/rvv-bench-results/index.html