Great article, and it’s pretty interesting to see portable simd in use. I tried reproducing the benchmarks on a Zen 3 system, and I’m getting identical speedups. On my M1 mbp, the perf gains start around 1.4x, gradually increasing to 2x @ 110 bytes of input length. A bit less gains than on x86_64, but I’d say it fulfils its goal.
Although looking at the code, this article confirms my experience that Rust has rather poor ergonomics for simd and pointer-related work (and performance engineering in general).
Haven't read through the article yet, but as a Rust eng... I sort of agree, but ... I mean, pointer & raw memory related work is deliberately kneecapped in the name of safety, the language wants you to really think about what you're doing.
But yep portable SIMD in Rust is still not a good story, compared to C++. And getting down to raw byte regions, pointer & buffer manipulation, etc requires becoming comfortable with Pin, MaybeUninit, etc. Both portable simd and allocator_api have been sitting unstable for years. And the barrier to entry is a bit higher, and it's more awkward ... mostly on purpose
But there's nothing stopping one from building one's own abstractions (or using 3rd party crates etc) to make these things more ergonomic within one's own program?
Being more ergonomic than C++ is a low bar anyways, we should hold ourselves to a higher standard. For starters, field offset syntax is painful (see Gankra's post on this topic) and pointer manipulation lacks syntactic sugar. I also believe unsigned integer promotion and UFCS would improve ergonomics with high-perf code (although I'm aware this is not going to be included in Rust).
At that point it is better to have some kind of DSL that should not be in the main language, because it would target a much lower level than a typical program. The best effort I've seen in this scene was Google's Highway [1] (not to be confused with HighwayHash) and I even once attempted to recreate it in Rust, but it is still a bit far from my ideal.
I agree about having a DSL. It may not need to be low level, you could have an array language like APL/J. But glsl would work for me too, the important thing is to lure developers to expose parallelism.
The kind of things you can use direct SIMD for is a reasonably-different space than array languages; while some things can translate reasonably, APL-likes get much less pretty on operations where elements don't have homogeneous meaning/mapping, e.g.: the base64 conversion of OP. (for reference, I work on https://github.com/dzaima/CBQN, an implementation of BQN, an APL derivative, and we have a DSL for implementing SIMD primitives for it (and sometimes scalar ones too): https://github.com/mlochbaum/Singeli)
Check out the Clang extended vector extension for C, IMHO that's the perfect way to add portable SIMD to a language. Doing it through a library will always be clumsy.
It quickly runs into __builtin functions though, which is not very different from intrinsics with a better and portable name. Rust `Simd` type is much more consistent with corresponding scalar types compared to that.
True, I think the builtins were necessary where the existing C operators don't map well to a specific SIMD feature (from what I've seen it's mostly "reduce" actions to reduce SIMD lanes into a single value). Still, a lot can be done without having to pepper the code with builtin functions, and the result is much more readable than purely intrinsic-based code.
But a "new" language like Rust would have the freedom to add special language syntax and operators for such things instead of going the C++ way of "just add it to the stdlib" (which IMHO is almost always the wrong place... it should either go into a regular cargo dependency, or into the language itself).
It is crazy sometimes how you try to program something the best you ever could with classical C++, and then someone comes along and makes a version using SIMD that is over 10x faster (but less portable code).
I really wish compilers were better at auto vectorization. And some support added for annotations in the language to locally allow reordering some operations, ...
Good SIMD code requires careful consideration of how your data is laid out in memory. This makes auto vectorization really difficult since the compiler can’t fix your data for you outside of very local contexts
There are unfortunately a lot of serial guarantees that are unavoidable even if compilers could optimize perfectly. For example: 'for(double v: vec) sum+=v'
Floating point addition is not associative, so summing each value in order is not the same as summing every 8th element, then summing the remainder, which is how SIMD handles it. So even though this is an obvious optimization for compilers, they will prioritize the serial guarantee over the optimization unless you tell it to relax that particular guarantee.
It's a mess and I agree with janwas: use a library (and in particular: use Google Highway) or something like Intel's ISPC when your hot path needs this.
Hence my suggestion for language support. Add some language syntax where you can say: "add these doubles, order doesn't matter", which allows the compiler to use SIMD
-ffast-math is global though and can break other libraries. I mean local, vectorization compatible but portable syntax, without actually writing it (let compiler do the work)
I'm guilty of having used it, but form the doc: " The optimize attribute should be used for debugging purposes only. It is not suitable in production code."
One of the things I struggled with in Rust’s portable_simd is that I think fadd_fast and simd intrinsics don’t mix well. This makes writing dot products only ok, rather than really easy. Here’s (one) reference: https://internals.rust-lang.org/t/suggestion-ffastmath-intri...
> It is crazy sometimes how you try to program something the best you ever could with classical C++, and then someone comes along and makes a version using SIMD that is over 10x faster (but less portable code).
But that’s one of the points of a systems programming language (of which C++ is one) — it tries to be portably as efficient as possible but makes it easy to do target-specific programming when required.
> I really wish compilers were better at auto vectorization
FORTRAN compilers sure are, since aliasing is not allowed. C++ is kneecapped by following C’s memory model.
But why is C++ refusing to add "restrict" to its spec? C has restrict in its spec, C++ has not. Of course compilers inherit restrict from C, but it's not super portable due to C++ not officially supporting it. But why?!
In or out of the spec doesn't really matter. You can use __restrict__ in C++ and any compiler will diligently accept it, if not for the occasional warning.
The larger issue is that C and C++ have convoluted aliasing rules which makes reasoning and flagging non-aliased regions hard. The most innocent of cast can break strict aliasing rules and send you in horrible debugging scenarios. The sad side effect being people adding -fno-strict-aliasing to -O2/3.
I was wondering the same! I just recently came across the restrict keyword (thanks to someone mentioning it on HN) but I have never seen it being mentioned before, even though I was looking into all sorts of optimizations (also ChatGPT would never bring it up on its own).
Since with templating everything seems to be going header only anyways, I feel like we should give the compiler more power to reason about the memory layout, even through multiple levels of function invocation/inlining.
I wrote some library where basically all shapes can be infered at compile-time (through templates) but the compiler only rarely seems to take advantage of this.
It also sucks that you have no way of knowing at build time whether autovectorisation succeeded. Even if you have it working initially it can silently break and you don't know where. So using clang, you're careful and you check the generated code and everything is great, then later it isn't and you have to go hunting to find out why.
Fortran compilers can assume that (many) dummy arguments to a particular procedure reference do not alias each other, true. But there are many other ways in which aliasing can happen in a conforming Fortran program.
DPC++/SYCL might be more interesting in this area (GPU code looking like evolved C++).
ISPC is an interesting take on 'building a vector DSL which can easily be integrated in a C++ build-chain'.
Though calling GPUs SIMD nowadays is kind of reductive, since you also have to contain with a very constrained memory hierarchy, and you don't write or think your code as SIMD much but more. Also they don't give access to most bit-tricks, and the rigmarole of shuffles one would expect from a SIMD processor.
> Though calling GPUs SIMD nowadays is kind of reductive, since you also have to contain with a very constrained memory hierarchy, and you don't write or think your code as SIMD much but more. Also they don't give access to most bit-tricks,
Oh rly?
* Popcnt is defined in ROCm and CUDA.
* Shifts, XORs, AND, OR, NOT are all defined in GPUs.
* You can actually do AES and a lot of other bit-level twiddling in GPU space very efficiently. (aka: see any mining software for highly-efficient bit-twiddling).
Yeah, not as many tricks (ex: pext / pdep) as a CPU. But popcnt is the "big" one. GPUs also have bit-reverse instructions (which reduces the need for LS1B instructions like CPUs because you can do LS1B tricks on "both ends" by just bit-reversing).
But honestly, GPUs are so parallel and Register-space is so plentiful that you can probably just build whatever you want out of AND/OR/NOT/XOR instructions alone.
And since GPUs have single-cycle "popcnt", any symmetric bit-twiddling function can be built off of popcnt within a few cycles.
> and the rigmarole of shuffles one would expect from a SIMD processor.
Abuse of __shared__ memory and the full crossbar of a GPU-core means that you can arbitrarily shuffle data through __shared__ memory in like 5 clock ticks (well... in certain ways... if you do it wrong it'd take many clock ticks).
GPUs are actually far superior to AVX512 in this front. NVidia's shfl.bfly instruction can implement FFTs, even-odd sorting, bitonic sorts, and more as high-speed communications.
Trust me on this case: GPU data-movement is far, far, far superior to CPU data-movement.
Intel does NOT have the bpermute or permute instructions like a GPU does. Meanwhile, both AMD and NVidia GPUs have bpermute / permute for both gather-and-scatter-like operation / distribution of data across lanes.
AMD/NVidia also have guarantees upon the speed of broadcasts across __shared__ memory. And butterfly-shuffles can theoretically implement any arbitrary shuffle you want in log2(SIMD-Width) steps, so we have incredible amounts of tools in GPU space that CPU-programmers have no idea about.
Thanks for the enlightenment here, some of those I didn't know. Being used to explicit vectors, going from avx512 proficiency, pdep/pext, gfni rabbit holes, I kind of forgot all of that when going to cuda, trying to avoid the trap of 'doing C in Rust (or Ada, hopefully the meaning is clear)', jumped to new idioms and I must say most of those you cited never appeared in most high performance code I've read, and some I only saw perusing ptx and lower level compiled code, which... I was never sure nvidia would maintain over time. Seems for us it's cub, barriers, atomics and ballots.
It seems I have lots of reading to do and lots of ways to improve my sorting networks / counting sort implementations.
> I was never sure nvidia would maintain over time.
PTX is maintained over time. Its a high-level assembly so to speak, the full details of the machine remain abstracted so that code can be more portable.
SASS is not. SASS changes from architecture-to-architecture. SASS is the actual machine code of NVidia cards. There's an overall understanding of SASS in the GPU world but its not really documented and you "shouldn't" want to learn about it.
--------
I should note that Intel's "pshufb" instruction is very similar to the permute instruction in NVidia/AMD. So yeah, there's a high-speed generic shuffle that's key to Intel/AMD AVX512 code.
But having the backwards-direction (bpermute) available too, as well as __shared__ memory for all other cases is great.
I saw the proof in some Binary-decision-diagram book that I've forgotten the name of.
The gist is that if you have a symmetric function (ie: order doesn't matter) of say, 8-bits, then that means f(10101010) == f(11110000) == f(00001111), and all such combinations. Because this is the very definition of "order doesn't matter".
This necessarily implies that f(bits) can be rewritten into the form of f(bits) == g(popcnt(bits)).
Something to do with counting up all the possibilities of "order doesn't matter" and then pigeon-holing them into popcnt combinations or something really mathematical like that. I'm sorry I don't remember the proof, but hopefully this is enough to give you the gist of the idea.
------------------
For example: XOR is the simplest symmetric function. We can see that XOR can be rewritten as XOR(bits) == g(popcnt(bits)). Where G is "take the bottom-bit from popcnt".
A lot of the "power" of BDDs is their ability to uncannily decompose functions into symmetric and non-symmetric parts. Not consistently mind you, but if a BDD is well-behaved (low-memory space, high-speeds, etc. etc.), its likely because a large part of the calculations happened to be symmetric.
This can be beneficial with say addFourDWORDs(a, b, c, d), where a+b+c+d has a "partial" level of symmetry. (a_0 XOR b_0 XOR c_0 XOR d_0 determines output_0 bit... while the 1st bit is related to XOR (a1,b1,c1,d1), etc. etc.). So there's all kinds of 'hidden popcounts" that could pop up in practice for random functions (IE: "partially symmetric"), though its a puzzle on how to exactly decompose arbitrary bit-functions into this form.
And even if you do, its no guarantee that the popcnt form was actually faster either. But it does give you some "trick" to try when trying to optimize an arbitrary bitwise function into a high-speed routine.
-----------------
Other symmetric functions (assuming 8-bit numbers)
Small note - while the compiler wasn't able to optimize that specific popcount implementation to a single instruction, it can for some other implementations, though it's of course very picky: https://godbolt.org/z/T69KxWWW8
Found a canvas-based library for this: https://larsjung.de/pagemap/. Definitely not what OP uses, where the minimap is a shrunk copy of the content markup, with all the drawbacks, such as page search finding the same item twice.
How does this compare to fastbase64[0]? Great article, I'm happy to see this sort of thing online. I wish I could share the author's optimism about portable SIMD libraries.
hm, any tool that enables people to write more SIMD is probably a good thing, but personally I find it preferable to have the SIMD integrated into the same toolchain. This allows us to (inlined!) call back into C++, use templates/classes in the SIMD code, and also inline together multiple SIMD code regions.
I agree implementing dynamic dispatch is difficult, but Highway has taken care of that :)
Highway is a step to the right direction, but more difficult to use compared to ISPC - e.g. handling remainders is mostly manual and somewhat error-prone.
Just saw your comment. I agree that remainders are tricky.
We do have some ready-made algorithms in hwy/contrib/algo and hwy/contrib/unroller which avoid remainder handling. Any thoughts on how this can be improved?
Yes, extremely easy. ISPC emits objects and header files, which can be directly linked/included from C++. Dynamic dispatching happens automatically and the widest version of the called function gets selected according to the runtime architecture.
Ah, it's just not your field of work. Just like average person is not soft.engineer or physicist or... you get the point. Few months of dedicated study and you would be able to do similar for sure.
If you ever had occasion to have an employer and/or project where this was called for, you could probably "be this smart"; it just comes down to interest and necessity.
I dip in and out of these waters (performance optimization, more systems bare metal engineering), but basically on personal projects. I wish I had jobs where it was more called for, but it's not what most industry work needs.
Try doing AoC '23 in APL/j/k, BQN or Python/numpy (meaning, not idiomatic python, but everything with numpy) or cuda etc. It's fun and it will teach you these types of smarts; a lot of the post is very natural on how you think about solving things in those languages. After while you start seeing problems in that form.
Interesting article! Right at the start, the first example of a non-vectorized popcnt implementation is said to produce “comically bad code, frankly”, but in release mode with the native target CPU, it does appear to be vectorising the function fairly ok?
LLVM has a LoopIdiomRecognize pass which will, amongst other patterns, try to detect a loop implementing popcount [1] and convert it to the llvm.ctpop.* intrinsic [2]. I've not looked at why it's not matching this loop, but the sibling comment suggesting to just use `.count_ones` seems like the right answer for Rust. In C or C++ I'd strongly recommend using `__builtin_popcount` (provided by both Clang and GCC).
Instruction counts are not everything. The rust assembly is completely branchless, and on my M2 MacBook I'm benching it at ~2ns per call vs ~10ns per call with the equivalent C++ code (running this: https://godbolt.org/z/nTanvfs76)
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_cpp 10.1 ns 10.1 ns 69503053
BM_rs 1.74 ns 1.72 ns 393022172
I could be doing something horribly wrong here - this was a nice intro into a few different things I haven't used before. But I ended up taking the rust function and outputting the LLVM IR for it, then compiling and linking that with the C++ benchmark. And it worked, after I realised how to get the optimizer to stop removing everything.
Instruction count is not everything but when something is 10x larger, unless the 1x is extremely complicated, I think it is safe to expect that the 10x version will run slower. This 1x version is extremely trivial despite that one branch.
.L4:
mov edx, edi
and edx, 1
cmp edx, 1
sbb eax, -1
shr edi
jne .L4
ret
In the first run, jne might be mispredicted (equating to a pipeline flush, ~15 cycles) but in all the other consecutive runs it will be predicted correctly. This means that the larger the amount of iterations is, cost of the first misprediction becomes more and more negligible.
Vectorized execution OTOH is not cheap and does not necessarily result in faster code so purely seeing it in assembly would not mean much without actually measuring it. Scalar versions of the same code may be faster than the compiler-generated auto-vectorization code.
That said, I am surprised by the results - in my mental model of how things work this does not add up. Rust assembly is ~50 instructions, roughly ~30 of them being vector (SIMD) instructions and the rest of ~20 instructions being the superset of (only) 6 instructions used in C++ assembly. I cannot see how those 6 instructions, even with the branch mispredict cost of ~15 cycles, can run ~2.5x slower than the pretty much convoluted version of the SIMD popcnt.
> That said, I am surprised by the results - in my mental model of how things work this does not add up. Rust assembly is ~50 instructions, roughly ~30 of them being vector (SIMD) instructions and the rest of ~20 instructions being the superset of (only) 6 instructions used in C++ assembly. I cannot see how those 6 instructions, even with the branch mispredict cost of ~15 cycles, can run ~2.5x slower than the pretty much convoluted version of the SIMD popcnt.
> I cannot see how those 6 instructions, even with the branch mispredict cost of ~15 cycles, can run ~2.5x slower than the pretty much convoluted version of the SIMD popcnt.
The quick-bench tool gives some nice assembly-level timing information[1] but I couldn't get it to work with my janky inline-copy-paste-assembly, and I blew past my curiosity time budget for this. You seem pretty experienced and I'd love to know more about this, so maybe if you know how to get the "popcnt_rs" extern assembly function to be inlined in the quick-bench link below we could take a look in more detail?
There's a data dependency between those 6 instructions so out-of-order execution unit won't be able to kick in as much but still ... each of those 6 instructions takes only 1-2 cycles to execute. Perhaps a better question to ask then is: how many loops there need to be so that 6x {1, 2} cycles outweigh the SIMD version?
(I wrote this initially as an EDIT in my response above but since you replied at about the same time, I am copying it here)
Yes, just because you see SIMD instructions doesn't mean the code is fast. You need the correct instructions.
Not relevant to this example, which is `popcount`ing only a single number, but AVX512 did also introduce SIMD popcount instructions for getting many inputs at a time. Also true for other useful bit-funs like leading or trailing zeros.
So if you're using zmm registers, you can do way better than that godbolt example.
autovectorization is hit and miss. I recently wrote this which has to count the bits in a result mask of vector operations and it turns into popcnt just fine.
> This seems like a trick question… just an add right?
This sort of thing is part of why you'd often like to target an intermediate vector representation and let the compiler figure out the details. E.g., on Haswell chips you had multiple floating point execution units per core, and the CPU could definitely execute more than one pipelined FP operation simultaneously, but only one of those could be an `add` instruction. If you had a bunch of additions to do which didn't rely on the previously computed results (avoiding stalls), you could double your addition throughput by also dispatching a fused-multiply-add instruction where the multiplicand was just one. That _could_ execute at the same time as a normal vector FP addition.
Not OP, but one thing that surprised me was if you are doing rust Simd in a library, and part of the code is marked #[inline] but others are not you might see catastrophic performance regressions. We saw an issue where the SIMD version was over 10x slower because we missed marking one function as inline. Essentially rustc converted it from an intrinsic to a regular function call.
> Improvements to NumPy’s performance are important to many users. We have focused this effort on Universal SIMD (see NEP 38 — Using SIMD optimization instructions for performance) intrinsics which provide nice improvements across various hardware platforms via an abstraction layer. The infrastructure is in place, and we welcome follow-on PRs to add SIMD support across all relevant NumPy functions
I don't think so. SIMD is more like a weird mix of vector processing and bizzare mathematical transformations originated from low-level facts, and the latter becomes more significant when SIMD targets shorter inputs.
I recently helped my friend who needed a fast CCSDS Reed-Solomon decoder in Python. The existing Python library was really slow, like 40 blocks per second, while the same algorithm written in C# claimed 10,000+ blocks per second. It turned out that the Python version made use of Numpy but so badly that it was much faster to eschew Numpy. After some additional optimizations to avoid modulos, my version easily handled ~2,000 blocks per second without Numpy.
Although looking at the code, this article confirms my experience that Rust has rather poor ergonomics for simd and pointer-related work (and performance engineering in general).