Hacker Newsnew | past | comments | ask | show | jobs | submit | camel-cdr's commentslogin

K&R syntax is -1 char, if you are in C:

    double solve(double a,double b,double c,double d){return a+b+c+d;}
    double solve(double a...){return a+1[&a]+2[&a]+3[&a];}
    double solve(a,b,c,d)double a,c,b,d;{return a+b+c+d;}

> For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?

Use Zvzip, in the mean time:

zip: vwmaccu.vx(vwaddu.vv(a, b), -1, b), or segmented load/store when you are touching memory anyways

unzip: vsnrl

trn1/trn2: masked vslide1up/vslide1down with even/odd mask

The only thing base RVV does bad in those is register to register zip, which takes twice as many instructions as other ISAs. Zvzip gives you dedicated instructions of the above.


Looks like the ratification plan for Zvzip is November. So maybe 3y until HW is actually usable? That's a neat trick with wmacc, congrats. But still, half the speed for quite a fundamental operation that has been heavily used in other ISAs for 20+ years :(

Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?

[1]: https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3... [2]: https://github.com/google/highway/blob/master/g3doc/quick_re...


OK, look.

Since my previous attempt to measure the impact of trap on signed overflow didn't seem to have moved your position one bit, I thought I'd give it a go in the most representable way I could think of:

I build the same version of clang on a x86, aarch64 and RISC-V system using clang. Then I build another version with the `-ftrapv` flag enabled and compared the compiletimes of compiling programs using these clang builds running on real hardware:

    runtime:         x86         | aarch64                    | RISC-V (RVA23)
                     Zen1        |  A78          A55*         |  X100         A100  !!! all cores clocked to about 2.2GHz, Zen1 can reach almost 4GHz
    clang A:         3.609±0.078 |  4.209±0.050   9.390±0.029 |  5.465±0.070  11.559±0.020
    clang-ftrapv A:  3.613±0.118 |  4.290±0.050   9.418±0.056 |  5.448±0.060  11.579±0.030
    clang B:         8.948±0.100 | 10.983±0.188  22.827±0.016 | 13.556±0.016  28.682±0.023
    clang-ftrapv B:  8.960±0.125 | 11.099±0.294  22.802±0.039 | 13.511±0.018  28.741±0.050


As you can see, once again the overhead of -ftrapv is quite low.

Suprizinglt the -ftrapv overhead seems the highest on the Cortex-A78. My guess is that this because clang generates a seperate brk with unique immediate for every overflow check, while on RISC-V it always branches to one unimp per function.

Please tell me if you have a better suggestion for measuring the real world impact.

Or heck, give me some artificial worst case code. That would also be an interesting data point.

Notes:

* The format is mean±variance

* Spacemit X100 is a Cortex-A76 like OoO RISC-V core and A100 an in-order RISC-V core.

* I tried to clock all of the cores to the same frequency of about 2.2GHz. *Except for the A55, which ran at 1.8GHz, but I linearly scaled the results.

* Program A was the chibicc (8K loc) compiler and program B microjs (30K loc).

    binary size:
                  x86        aarch64    RISC-V
    clang:        212807768  216633784  195231816
    clang-ftrapv: 212859280  216737608  195419512
    increase:     0.24%      0.047%     0.09%

I suspect that LLVM is optimized for compiling with `-ftrapv`, perhaps for cheap sanitizing or maybe just due to design decisions like using unsigned integers everywhere (please correct me if I'm wrong). I'm personally interested in how RISC-V behaves on computational tasks where computing carry is a known bottleneck, like long addition. Maybe looking at libgmp could be interesting, though I suspect absolute numbers will not be meaningful, and there's no baseline to compare them to.

LLVM mostly uses size_t like most C/C++ programs, which either use size_t or int for everything, both of which are handled well by RISC-V.

> Maybe looking at libgmp could be interesting, though I suspect absolute numbers will not be meaningful, and there's no baseline to compare them to.

Realistically, nobody cares about BigInt addition performance, considering there is no GMP implementarion using SIMD, or even any using dependency breaking to get beyond 64-bit per cycle.

I whipped up a quick AVX-512 implementation that was 2x faster than libgmp on Zen4 (which has 256-bit SIMD ALUs). On RISC-V you'd just use RVV to do BigInt stuff.


"nobody cares about BigInt addition performance" is an odd claim to make when half of the world's cryptography is based on ECC.

Exactly, I 100% agree, and IMO toolchains should default to assuming fast misaligned load/store for RISC-V.

However, the spec has the explicit note:

> Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only for correctness, not for performance.

Which was a mistake. As you said any instruction could be arbitrarily slow, and in other aspects where performance recommendations could actually be useful RVI usually says "we can't mandate implementation".


The cursed thing is that RVA23 does basically guarantees that `vle8.v` + `vmv.x.s` on misaligned addresses is fast.

Yeah, that is quite funky; and indeed gcc does that. Relatedly, super-annoying is that `vle64.v` & co could then also make use of that same hardware, but that's not guaranteed. (I suppose there could be awful hardware that does vle8.v via single-byte loads, which wouldn't translate to vle64.v?)

x86 is a lot easier to JIT to Arm or RISC-V though, because it has fewer registers.


Signend 64-bit is the worst case. When I tried to enable overflow checking thr overhead of RISC-V and Arm was comparable: https://news.ycombinator.com/item?id=46588159#46668916


Should've probably been the link to the actual Ubuntu blog: https://ubuntu.com//blog/canonical-and-ubuntu-risc-v-a-2025-...


Agreed, however, I'm quite sure 25,000 lines translated in "multiple months" is very "slow", for a naive translation between languages as similar as C++ and Rust.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: