Designing a SIMD Algorithm from Scratch

dist1ll · on Nov 28, 2023

Great article, and it’s pretty interesting to see portable simd in use. I tried reproducing the benchmarks on a Zen 3 system, and I’m getting identical speedups. On my M1 mbp, the perf gains start around 1.4x, gradually increasing to 2x @ 110 bytes of input length. A bit less gains than on x86_64, but I’d say it fulfils its goal.

Although looking at the code, this article confirms my experience that Rust has rather poor ergonomics for simd and pointer-related work (and performance engineering in general).

cmrdporcupine · on Nov 28, 2023

Haven't read through the article yet, but as a Rust eng... I sort of agree, but ... I mean, pointer & raw memory related work is deliberately kneecapped in the name of safety, the language wants you to really think about what you're doing.

But yep portable SIMD in Rust is still not a good story, compared to C++. And getting down to raw byte regions, pointer & buffer manipulation, etc requires becoming comfortable with Pin, MaybeUninit, etc. Both portable simd and allocator_api have been sitting unstable for years. And the barrier to entry is a bit higher, and it's more awkward ... mostly on purpose

But there's nothing stopping one from building one's own abstractions (or using 3rd party crates etc) to make these things more ergonomic within one's own program?

pcwalton · on Nov 28, 2023

I disagree with the ergonomics being poor. C++ SSE intrinsics are significantly worse, with ugly underscores and names that are hard to remember.

dist1ll · on Nov 28, 2023

Being more ergonomic than C++ is a low bar anyways, we should hold ourselves to a higher standard. For starters, field offset syntax is painful (see Gankra's post on this topic) and pointer manipulation lacks syntactic sugar. I also believe unsigned integer promotion and UFCS would improve ergonomics with high-perf code (although I'm aware this is not going to be included in Rust).

lifthrasiir · on Nov 28, 2023

At that point it is better to have some kind of DSL that should not be in the main language, because it would target a much lower level than a typical program. The best effort I've seen in this scene was Google's Highway [1] (not to be confused with HighwayHash) and I even once attempted to recreate it in Rust, but it is still a bit far from my ideal.

[1] https://github.com/google/highway

superjan · on Nov 28, 2023

I agree about having a DSL. It may not need to be low level, you could have an array language like APL/J. But glsl would work for me too, the important thing is to lure developers to expose parallelism.

dzaima · on Nov 29, 2023

The kind of things you can use direct SIMD for is a reasonably-different space than array languages; while some things can translate reasonably, APL-likes get much less pretty on operations where elements don't have homogeneous meaning/mapping, e.g.: the base64 conversion of OP. (for reference, I work on https://github.com/dzaima/CBQN, an implementation of BQN, an APL derivative, and we have a DSL for implementing SIMD primitives for it (and sometimes scalar ones too): https://github.com/mlochbaum/Singeli)

janwas · on Nov 28, 2023

Thanks :) We're always curious to hear how it could be improved?

lifthrasiir · on Nov 29, 2023

I think Highway is already good enough as a non-DSL. I had something like GLSL in mind, but more blended to the parent language.

secondcoming · on Nov 28, 2023

Those intrinsics are specified by Intel, not C++

flohofwoe · on Nov 28, 2023

Check out the Clang extended vector extension for C, IMHO that's the perfect way to add portable SIMD to a language. Doing it through a library will always be clumsy.

lifthrasiir · on Nov 28, 2023

It quickly runs into __builtin functions though, which is not very different from intrinsics with a better and portable name. Rust `Simd` type is much more consistent with corresponding scalar types compared to that.

flohofwoe · on Nov 28, 2023

True, I think the builtins were necessary where the existing C operators don't map well to a specific SIMD feature (from what I've seen it's mostly "reduce" actions to reduce SIMD lanes into a single value). Still, a lot can be done without having to pepper the code with builtin functions, and the result is much more readable than purely intrinsic-based code.

But a "new" language like Rust would have the freedom to add special language syntax and operators for such things instead of going the C++ way of "just add it to the stdlib" (which IMHO is almost always the wrong place... it should either go into a regular cargo dependency, or into the language itself).

vt240 · on Nov 28, 2023

I loved the Intel Cilk Plus project. I was sad to see it was abandoned. It always felt like a very natural syntax at least to me.

bottled_poe · on Nov 28, 2023

As if semantics are a significant productivity factor, get off ya high horse.

Aardwolf · on Nov 28, 2023

It is crazy sometimes how you try to program something the best you ever could with classical C++, and then someone comes along and makes a version using SIMD that is over 10x faster (but less portable code).

I really wish compilers were better at auto vectorization. And some support added for annotations in the language to locally allow reordering some operations, ...

cmovq · on Nov 28, 2023

Good SIMD code requires careful consideration of how your data is laid out in memory. This makes auto vectorization really difficult since the compiler can’t fix your data for you outside of very local contexts

kolbe · on Nov 28, 2023

There are unfortunately a lot of serial guarantees that are unavoidable even if compilers could optimize perfectly. For example: 'for(double v: vec) sum+=v'

Floating point addition is not associative, so summing each value in order is not the same as summing every 8th element, then summing the remainder, which is how SIMD handles it. So even though this is an obvious optimization for compilers, they will prioritize the serial guarantee over the optimization unless you tell it to relax that particular guarantee.

It's a mess and I agree with janwas: use a library (and in particular: use Google Highway) or something like Intel's ISPC when your hot path needs this.

Aardwolf · on Nov 28, 2023

Hence my suggestion for language support. Add some language syntax where you can say: "add these doubles, order doesn't matter", which allows the compiler to use SIMD

kolbe · on Nov 28, 2023

-ffast-math if you want to push the onus on the compiler.

Problem is you need to enforce that requirement on all user compilations, and I don't know what for MSVC. In-language would be nice.

Aardwolf · on Nov 28, 2023

-ffast-math is global though and can break other libraries. I mean local, vectorization compatible but portable syntax, without actually writing it (let compiler do the work)

eesmith · on Nov 28, 2023

According to https://stackoverflow.com/questions/40699071/can-i-make-my-c... you annotate it with: __attribute__((optimize("-ffast-math")))

I tried it with:

  double sum1(double arr[128]) {
    double tot = 0.0;
    for (int i=0; i<128; i++) {
      tot += arr[i];
    }
    return tot;
  }

  __attribute__((optimize("-ffast-math")))
  double sum2(double arr[128]) {
    double tot = 0.0;
    for (int i=0; i<128; i++) {
      tot += arr[i];
    }
    return tot;
  }

The Compiler Explorer (gcc 13.2, --std=c++20 -march=native -O3) generates two different bodies for those:

    sum1(double*):
        lea     rax, [rdi+1024]
        vxorpd  xmm0, xmm0, xmm0
    .L2:
        vaddsd  xmm0, xmm0, QWORD PTR [rdi]
        add     rdi, 32
        vaddsd  xmm0, xmm0, QWORD PTR [rdi-24]
        vaddsd  xmm0, xmm0, QWORD PTR [rdi-16]
        vaddsd  xmm0, xmm0, QWORD PTR [rdi-8]
        cmp     rax, rdi
        jne     .L2
        ret
    sum2(double*):
        lea     rax, [rdi+1024]
        vxorpd  xmm0, xmm0, xmm0
    .L6:
        vaddpd  ymm0, ymm0, YMMWORD PTR [rdi]
        add     rdi, 32
        cmp     rax, rdi
        jne     .L6
        vextractf64x2   xmm1, ymm0, 0x1
        vaddpd  xmm1, xmm1, xmm0
        vunpckhpd       xmm0, xmm1, xmm1
        vaddpd  xmm0, xmm0, xmm1
        vzeroupper
        ret

It is still compiler-specific and non-portable, but at least it is not global.

gpderetta · on Nov 29, 2023

I'm guilty of having used it, but form the doc: " The optimize attribute should be used for debugging purposes only. It is not suitable in production code."

It can be very unreliable.

singhrac · on Nov 29, 2023

One of the things I struggled with in Rust’s portable_simd is that I think fadd_fast and simd intrinsics don’t mix well. This makes writing dot products only ok, rather than really easy. Here’s (one) reference: https://internals.rust-lang.org/t/suggestion-ffastmath-intri...

kolbe · on Nov 28, 2023

Totally fair. I'm with you.

gumby · on Nov 28, 2023

> It is crazy sometimes how you try to program something the best you ever could with classical C++, and then someone comes along and makes a version using SIMD that is over 10x faster (but less portable code).

But that’s one of the points of a systems programming language (of which C++ is one) — it tries to be portably as efficient as possible but makes it easy to do target-specific programming when required.

> I really wish compilers were better at auto vectorization

FORTRAN compilers sure are, since aliasing is not allowed. C++ is kneecapped by following C’s memory model.

Aardwolf · on Nov 28, 2023

But why is C++ refusing to add "restrict" to its spec? C has restrict in its spec, C++ has not. Of course compilers inherit restrict from C, but it's not super portable due to C++ not officially supporting it. But why?!

Galanwe · on Nov 29, 2023

In or out of the spec doesn't really matter. You can use __restrict__ in C++ and any compiler will diligently accept it, if not for the occasional warning.

The larger issue is that C and C++ have convoluted aliasing rules which makes reasoning and flagging non-aliased regions hard. The most innocent of cast can break strict aliasing rules and send you in horrible debugging scenarios. The sad side effect being people adding -fno-strict-aliasing to -O2/3.

WanderPanda · on Nov 29, 2023

I was wondering the same! I just recently came across the restrict keyword (thanks to someone mentioning it on HN) but I have never seen it being mentioned before, even though I was looking into all sorts of optimizations (also ChatGPT would never bring it up on its own).

Since with templating everything seems to be going header only anyways, I feel like we should give the compiler more power to reason about the memory layout, even through multiple levels of function invocation/inlining.

I wrote some library where basically all shapes can be infered at compile-time (through templates) but the compiler only rarely seems to take advantage of this.

secondcoming · on Nov 28, 2023

It's not just aliasing. gcc is quite bad at autovectorising in general, clang is much better.

Here's a trivial algorithm that gcc borks over because 'bool' is used instead of 'int', even fixing that the codegen isn't great:

https://godbolt.org/z/hf7szo78E

harry8 · on Dec 1, 2023

Really good example.

It also sucks that you have no way of knowing at build time whether autovectorisation succeeded. Even if you have it working initially it can silently break and you don't know where. So using clang, you're careful and you check the generated code and everything is great, then later it isn't and you have to go hunting to find out why.

pklausler · on Nov 28, 2023

Fortran compilers can assume that (many) dummy arguments to a particular procedure reference do not alias each other, true. But there are many other ways in which aliasing can happen in a conforming Fortran program.

dragontamer · on Nov 28, 2023

You could just use CUDA, which is C++ designed for GPUs, the ultimate SIMD machine of today.

Or ROCm (basically CUDA but for AMD).

I always was a fan of Microsoft's C++AMP though. I thought that was easiest to get into. Too bad it never stuck though.

touisteur · on Nov 29, 2023

DPC++/SYCL might be more interesting in this area (GPU code looking like evolved C++).

ISPC is an interesting take on 'building a vector DSL which can easily be integrated in a C++ build-chain'.

Though calling GPUs SIMD nowadays is kind of reductive, since you also have to contain with a very constrained memory hierarchy, and you don't write or think your code as SIMD much but more. Also they don't give access to most bit-tricks, and the rigmarole of shuffles one would expect from a SIMD processor.

dragontamer · on Nov 29, 2023

> Though calling GPUs SIMD nowadays is kind of reductive, since you also have to contain with a very constrained memory hierarchy, and you don't write or think your code as SIMD much but more. Also they don't give access to most bit-tricks,

Oh rly?

* Popcnt is defined in ROCm and CUDA.

* Shifts, XORs, AND, OR, NOT are all defined in GPUs.

* You can actually do AES and a lot of other bit-level twiddling in GPU space very efficiently. (aka: see any mining software for highly-efficient bit-twiddling).

Yeah, not as many tricks (ex: pext / pdep) as a CPU. But popcnt is the "big" one. GPUs also have bit-reverse instructions (which reduces the need for LS1B instructions like CPUs because you can do LS1B tricks on "both ends" by just bit-reversing).

But honestly, GPUs are so parallel and Register-space is so plentiful that you can probably just build whatever you want out of AND/OR/NOT/XOR instructions alone.

And since GPUs have single-cycle "popcnt", any symmetric bit-twiddling function can be built off of popcnt within a few cycles.

> and the rigmarole of shuffles one would expect from a SIMD processor.

Abuse of __shared__ memory and the full crossbar of a GPU-core means that you can arbitrarily shuffle data through __shared__ memory in like 5 clock ticks (well... in certain ways... if you do it wrong it'd take many clock ticks).

GPUs are actually far superior to AVX512 in this front. NVidia's shfl.bfly instruction can implement FFTs, even-odd sorting, bitonic sorts, and more as high-speed communications.

AMD has similar instructions: https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operat...

Trust me on this case: GPU data-movement is far, far, far superior to CPU data-movement.

Intel does NOT have the bpermute or permute instructions like a GPU does. Meanwhile, both AMD and NVidia GPUs have bpermute / permute for both gather-and-scatter-like operation / distribution of data across lanes.

AMD/NVidia also have guarantees upon the speed of broadcasts across __shared__ memory. And butterfly-shuffles can theoretically implement any arbitrary shuffle you want in log2(SIMD-Width) steps, so we have incredible amounts of tools in GPU space that CPU-programmers have no idea about.

touisteur · on Dec 1, 2023

Thanks for the enlightenment here, some of those I didn't know. Being used to explicit vectors, going from avx512 proficiency, pdep/pext, gfni rabbit holes, I kind of forgot all of that when going to cuda, trying to avoid the trap of 'doing C in Rust (or Ada, hopefully the meaning is clear)', jumped to new idioms and I must say most of those you cited never appeared in most high performance code I've read, and some I only saw perusing ptx and lower level compiled code, which... I was never sure nvidia would maintain over time. Seems for us it's cub, barriers, atomics and ballots.

It seems I have lots of reading to do and lots of ways to improve my sorting networks / counting sort implementations.

Thanks again.

dragontamer · on Dec 1, 2023

> I was never sure nvidia would maintain over time.

PTX is maintained over time. Its a high-level assembly so to speak, the full details of the machine remain abstracted so that code can be more portable.

SASS is not. SASS changes from architecture-to-architecture. SASS is the actual machine code of NVidia cards. There's an overall understanding of SASS in the GPU world but its not really documented and you "shouldn't" want to learn about it.

--------

I should note that Intel's "pshufb" instruction is very similar to the permute instruction in NVidia/AMD. So yeah, there's a high-speed generic shuffle that's key to Intel/AMD AVX512 code.

But having the backwards-direction (bpermute) available too, as well as __shared__ memory for all other cases is great.

Aardwolf · on Nov 30, 2023

> And since GPUs have single-cycle "popcnt", any symmetric bit-twiddling function can be built off of popcnt within a few cycles

This sounds interesting, do you have a reference to this?

dragontamer · on Nov 30, 2023

I saw the proof in some Binary-decision-diagram book that I've forgotten the name of.

The gist is that if you have a symmetric function (ie: order doesn't matter) of say, 8-bits, then that means f(10101010) == f(11110000) == f(00001111), and all such combinations. Because this is the very definition of "order doesn't matter".

This necessarily implies that f(bits) can be rewritten into the form of f(bits) == g(popcnt(bits)).

Something to do with counting up all the possibilities of "order doesn't matter" and then pigeon-holing them into popcnt combinations or something really mathematical like that. I'm sorry I don't remember the proof, but hopefully this is enough to give you the gist of the idea.

------------------

For example: XOR is the simplest symmetric function. We can see that XOR can be rewritten as XOR(bits) == g(popcnt(bits)). Where G is "take the bottom-bit from popcnt".

A lot of the "power" of BDDs is their ability to uncannily decompose functions into symmetric and non-symmetric parts. Not consistently mind you, but if a BDD is well-behaved (low-memory space, high-speeds, etc. etc.), its likely because a large part of the calculations happened to be symmetric.

This can be beneficial with say addFourDWORDs(a, b, c, d), where a+b+c+d has a "partial" level of symmetry. (a_0 XOR b_0 XOR c_0 XOR d_0 determines output_0 bit... while the 1st bit is related to XOR (a1,b1,c1,d1), etc. etc.). So there's all kinds of 'hidden popcounts" that could pop up in practice for random functions (IE: "partially symmetric"), though its a puzzle on how to exactly decompose arbitrary bit-functions into this form.

And even if you do, its no guarantee that the popcnt form was actually faster either. But it does give you some "trick" to try when trying to optimize an arbitrary bitwise function into a high-speed routine.

-----------------

Other symmetric functions (assuming 8-bit numbers)

AND(bits) == (popcnt(bits) == 8).

OR(bits) == (popcnt(bits) > 0)

Etc. etc.

EDIT: Looks like Wikipedia has a link to this concept: https://en.wikipedia.org/wiki/Symmetric_Boolean_function

janwas · on Nov 28, 2023

This happens regularly in my experience :) Also, when using a SIMD wrapper library, it can actually be quite portable in practice.

dzaima · on Nov 28, 2023

Small note - while the compiler wasn't able to optimize that specific popcount implementation to a single instruction, it can for some other implementations, though it's of course very picky: https://godbolt.org/z/T69KxWWW8

Const-me · on Nov 28, 2023

> _mm256_cvtps_epu32 that represent a low-level operation in a specific instruction set (this is a float to int cast from AVX2)

That instruction is from AVX-512. There’s no AVX2 float to int cast. There’s in AVX1, the int is signed and the instruction is _mm256_cvtps_epi32.

mattsan · on Nov 28, 2023

In Lua we have indexes off by one and in SIMD we have keyboard letters off by one

uo21tp5hoyg · on Nov 28, 2023

Wow I really like that little mini-map on the right...

joennlae · on Nov 28, 2023

+1. Does someone know how to do that?

progbits · on Nov 28, 2023

Firefox only: https://news.ycombinator.com/item?id=36757542

m1el · on Nov 28, 2023

The minimap contains a copy of the content, but with `transform: scale`. The rest is handling `window.onscroll` and mouse events on the overlay.

Klaster_1 · on Nov 28, 2023

Found a canvas-based library for this: https://larsjung.de/pagemap/. Definitely not what OP uses, where the minimap is a shrunk copy of the content markup, with all the drawbacks, such as page search finding the same item twice.

orlp · on Nov 28, 2023

The author should really add at least aria-hidden="true" to the minimap element.

anonymoushn · on Nov 28, 2023

How does this compare to fastbase64[0]? Great article, I'm happy to see this sort of thing online. I wish I could share the author's optimism about portable SIMD libraries.

[0]: https://github.com/lemire/fastbase64

kookamamie · on Nov 29, 2023

ISPC is just better, compared to bolting SIMD on C++ or Rust. Also, it supports dynamic dispatching, a feature that is painful to implement yourself.

janwas · on Nov 29, 2023

hm, any tool that enables people to write more SIMD is probably a good thing, but personally I find it preferable to have the SIMD integrated into the same toolchain. This allows us to (inlined!) call back into C++, use templates/classes in the SIMD code, and also inline together multiple SIMD code regions.

I agree implementing dynamic dispatch is difficult, but Highway has taken care of that :)

kookamamie · on Nov 30, 2023

Highway is a step to the right direction, but more difficult to use compared to ISPC - e.g. handling remainders is mostly manual and somewhat error-prone.

janwas · on Dec 10, 2023

Just saw your comment. I agree that remainders are tricky. We do have some ready-made algorithms in hwy/contrib/algo and hwy/contrib/unroller which avoid remainder handling. Any thoughts on how this can be improved?

anonymoushn · on Nov 29, 2023

Is it easy to call into ISPC from C++ and rust for small subroutines like the ones in TFA?

kookamamie · on Nov 29, 2023

Yes, extremely easy. ISPC emits objects and header files, which can be directly linked/included from C++. Dynamic dispatching happens automatically and the widest version of the called function gets selected according to the runtime architecture.

alex_suzuki · on Nov 28, 2023

Awesome writeup. Leaves me with the distinct impression that „I‘ll never be this smart“.

hnisoss · on Nov 28, 2023

Ah, it's just not your field of work. Just like average person is not soft.engineer or physicist or... you get the point. Few months of dedicated study and you would be able to do similar for sure.

VitruviusBen · on Nov 28, 2023

Dunning Kruger effect?

cmrdporcupine · on Nov 28, 2023

If you ever had occasion to have an employer and/or project where this was called for, you could probably "be this smart"; it just comes down to interest and necessity.

I dip in and out of these waters (performance optimization, more systems bare metal engineering), but basically on personal projects. I wish I had jobs where it was more called for, but it's not what most industry work needs.

anonzzzies · on Nov 28, 2023

Try doing AoC '23 in APL/j/k, BQN or Python/numpy (meaning, not idiomatic python, but everything with numpy) or cuda etc. It's fun and it will teach you these types of smarts; a lot of the post is very natural on how you think about solving things in those languages. After while you start seeing problems in that form.

corysama · on Nov 28, 2023

https://fgiesen.wordpress.com/2016/02/05/smart/

orf · on Nov 28, 2023

Interesting article! Right at the start, the first example of a non-vectorized popcnt implementation is said to produce “comically bad code, frankly”, but in release mode with the native target CPU, it does appear to be vectorising the function fairly ok?

https://godbolt.org/z/WE1Eq65jY

eesmith · on Nov 28, 2023

The following should produce equivalent output:

  pub fn popcnt(mut x: u32) -> u32 {
    x.count_ones()
  }

which gets compiled to:

  example::popcnt:
        popcnt  eax, edi
        ret

For large bit vectors, an AVX2 implementation can outperform POPCNT. See "Faster Population Counts Using AVX2 Instructions" at https://academic.oup.com/comjnl/article/61/1/111/3852071

32 bits is not large enough, and the code Rust produces is indeed comically bad.

rwaksmunski · on Nov 29, 2023

TIL: https://doc.rust-lang.org/std/primitive.usize.html#method.co...

orf · on Nov 28, 2023

Would the equivalent c++ code run through LLVM generate a different output?

asb · on Nov 28, 2023

LLVM has a LoopIdiomRecognize pass which will, amongst other patterns, try to detect a loop implementing popcount [1] and convert it to the llvm.ctpop.* intrinsic [2]. I've not looked at why it's not matching this loop, but the sibling comment suggesting to just use `.count_ones` seems like the right answer for Rust. In C or C++ I'd strongly recommend using `__builtin_popcount` (provided by both Clang and GCC).

[1]: https://github.com/llvm/llvm-project/blob/08a6968127f04a40d7... [2]: https://llvm.org/docs/LangRef.html#llvm-ctpop-intrinsic

celrod · on Nov 28, 2023

`std::popcount` was added to the stl in C++20 https://en.cppreference.com/w/cpp/numeric/popcount

asb · on Nov 28, 2023

Thanks for pointing that out - definitely a better choice for C++.

menaerus · on Dec 7, 2023

That's easy to check, and the answer is yes it would. See https://godbolt.org/z/eqseqxhca

It's dramatically different, C++ compiles down to 10 or so instructions even without the vectorization turned on.

Rust OTOH compiles down to 188 or 129 instructions with the vectorization. This is pretty bad.

orf · on Dec 7, 2023

Wow, I spent way too long on this.

Instruction counts are not everything. The rust assembly is completely branchless, and on my M2 MacBook I'm benching it at ~2ns per call vs ~10ns per call with the equivalent C++ code (running this: https://godbolt.org/z/nTanvfs76)

  -----------------------------------------------------
  Benchmark           Time             CPU   Iterations
  -----------------------------------------------------
  BM_cpp           10.1 ns         10.1 ns     69503053
  BM_rs            1.74 ns         1.72 ns    393022172

benchmark code: https://gist.github.com/orf/7ce6ef09bde943f788d963e05c94c049

LLVM IR for the benchmark: https://godbolt.org/z/8PaoK4sYz

These results match a criterion benchmark: https://gist.github.com/orf/6dc9d6dbfdcd0df464c1c0a7274f6286

I could be doing something horribly wrong here - this was a nice intro into a few different things I haven't used before. But I ended up taking the rust function and outputting the LLVM IR for it, then compiling and linking that with the C++ benchmark. And it worked, after I realised how to get the optimizer to stop removing everything.

Similar results on Intel can be seen here: https://quick-bench.com/q/RxcWefXX5jo77NM-eK8u36W6ohk

menaerus · on Dec 7, 2023

Instruction count is not everything but when something is 10x larger, unless the 1x is extremely complicated, I think it is safe to expect that the 10x version will run slower. This 1x version is extremely trivial despite that one branch.

  .L4:
        mov     edx, edi
        and     edx, 1
        cmp     edx, 1
        sbb     eax, -1
        shr     edi
        jne     .L4
        ret

In the first run, jne might be mispredicted (equating to a pipeline flush, ~15 cycles) but in all the other consecutive runs it will be predicted correctly. This means that the larger the amount of iterations is, cost of the first misprediction becomes more and more negligible.

Vectorized execution OTOH is not cheap and does not necessarily result in faster code so purely seeing it in assembly would not mean much without actually measuring it. Scalar versions of the same code may be faster than the compiler-generated auto-vectorization code.

Each SIMD instruction has a latency attached to it (see https://www.intel.com/content/www/us/en/docs/intrinsics-guid...) and the majority of those SIMD latencies are in between 1 and 7 CPU cycles so definitely not free as in a free beer.

That said, I am surprised by the results - in my mental model of how things work this does not add up. Rust assembly is ~50 instructions, roughly ~30 of them being vector (SIMD) instructions and the rest of ~20 instructions being the superset of (only) 6 instructions used in C++ assembly. I cannot see how those 6 instructions, even with the branch mispredict cost of ~15 cycles, can run ~2.5x slower than the pretty much convoluted version of the SIMD popcnt.

orf · on Dec 7, 2023

> That said, I am surprised by the results - in my mental model of how things work this does not add up. Rust assembly is ~50 instructions, roughly ~30 of them being vector (SIMD) instructions and the rest of ~20 instructions being the superset of (only) 6 instructions used in C++ assembly. I cannot see how those 6 instructions, even with the branch mispredict cost of ~15 cycles, can run ~2.5x slower than the pretty much convoluted version of the SIMD popcnt.

> I cannot see how those 6 instructions, even with the branch mispredict cost of ~15 cycles, can run ~2.5x slower than the pretty much convoluted version of the SIMD popcnt.

I fleshed out the benchmarks a bit, to compare `popcnt` itself and a no-op baseline: https://quick-bench.com/q/Bb36LyruJ1r8pzm8o_hjckU-xcI

The quick-bench tool gives some nice assembly-level timing information[1] but I couldn't get it to work with my janky inline-copy-paste-assembly, and I blew past my curiosity time budget for this. You seem pretty experienced and I'd love to know more about this, so maybe if you know how to get the "popcnt_rs" extern assembly function to be inlined in the quick-bench link below we could take a look in more detail?

1. Click "assembly" and select "BM_cpp" from here: https://quick-bench.com/q/Bb36LyruJ1r8pzm8o_hjckU-xcI

menaerus · on Dec 7, 2023

There's a data dependency between those 6 instructions so out-of-order execution unit won't be able to kick in as much but still ... each of those 6 instructions takes only 1-2 cycles to execute. Perhaps a better question to ask then is: how many loops there need to be so that 6x {1, 2} cycles outweigh the SIMD version?

(I wrote this initially as an EDIT in my response above but since you replied at about the same time, I am copying it here)

saagarjha · on Nov 28, 2023

I mean ideally you’d probably want this to be lowered to a popcnt instruction.

celrod · on Nov 28, 2023

Yes, just because you see SIMD instructions doesn't mean the code is fast. You need the correct instructions.

Not relevant to this example, which is `popcount`ing only a single number, but AVX512 did also introduce SIMD popcount instructions for getting many inputs at a time. Also true for other useful bit-funs like leading or trailing zeros. So if you're using zmm registers, you can do way better than that godbolt example.

the8472 · on Nov 28, 2023

autovectorization is hit and miss. I recently wrote this which has to count the bits in a result mask of vector operations and it turns into popcnt just fine.

https://godbolt.org/z/zT9Whcnco

hansvm · on Nov 29, 2023

> This seems like a trick question… just an add right?

This sort of thing is part of why you'd often like to target an intermediate vector representation and let the compiler figure out the details. E.g., on Haswell chips you had multiple floating point execution units per core, and the CPU could definitely execute more than one pipelined FP operation simultaneously, but only one of those could be an `add` instruction. If you had a bunch of additions to do which didn't rely on the previously computed results (avoiding stalls), you could double your addition throughput by also dispatching a fused-multiply-add instruction where the multiplicand was just one. That _could_ execute at the same time as a normal vector FP addition.

fulafel · on Nov 30, 2023

Would be interesting to see a comparison vs languages that compile per lane logic to SIMD such as ISPC or (CPU side) OpenCL or others.

gbin · on Nov 28, 2023

This is a pretty good trial for Rust Simd! What were the most surprising quirks of it when you inspected the generated code?

celeritascelery · on Nov 28, 2023

Not OP, but one thing that surprised me was if you are doing rust Simd in a library, and part of the code is marked #[inline] but others are not you might see catastrophic performance regressions. We saw an issue where the SIMD version was over 10x slower because we missed marking one function as inline. Essentially rustc converted it from an intrinsic to a regular function call.

https://github.com/rust-lang/rust/issues/107617#issuecomment...

intalentive · on Nov 28, 2023

SIMD is pretty intuitive if you’ve used deep learning libraries, NumPy, array based or even functional languages

westurner · on Nov 28, 2023

NumPy roadmap: https://numpy.org/neps/roadmap.html :

> Improvements to NumPy’s performance are important to many users. We have focused this effort on Universal SIMD (see NEP 38 — Using SIMD optimization instructions for performance) intrinsics which provide nice improvements across various hardware platforms via an abstraction layer. The infrastructure is in place, and we welcome follow-on PRs to add SIMD support across all relevant NumPy functions

"NEP 38 — Using SIMD optimization instructions for performance" (2019) https://numpy.org/neps/nep-0038-SIMD-optimizations.html#nep3...

NumPy docs > CPU/SIMD Optimizations: https://numpy.org/doc/stable/reference/simd/index.html

std::simd: https://doc.rust-lang.org/std/simd/index.html

"Show HN: SimSIMD vs SciPy: How AVX-512 and SVE make SIMD nicer and ML 10x faster" (2023-10) https://news.ycombinator.com/item?id=37808036

"Standard library support for SIMD" (2023-10) https://discuss.python.org/t/standard-library-support-for-si...

westurner · on Nov 29, 2023

Automatic vectorization > Techniques: https://en.wikipedia.org/wiki/Automatic_vectorization#Techni...

SIMD: Single instruction, multiple data: https://en.wikipedia.org/wiki/Single_instruction,_multiple_d...

Category:SIMD computing: https://en.wikipedia.org/wiki/Category:SIMD_computing

Vectorization: Introduction: https://news.ycombinator.com/item?id=36159017 :

> GPGPU > Vectorization, Stream Processing > Compute kernels: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

lifthrasiir · on Nov 29, 2023

I don't think so. SIMD is more like a weird mix of vector processing and bizzare mathematical transformations originated from low-level facts, and the latter becomes more significant when SIMD targets shorter inputs.

I recently helped my friend who needed a fast CCSDS Reed-Solomon decoder in Python. The existing Python library was really slow, like 40 blocks per second, while the same algorithm written in C# claimed 10,000+ blocks per second. It turned out that the Python version made use of Numpy but so badly that it was much faster to eschew Numpy. After some additional optimizations to avoid modulos, my version easily handled ~2,000 blocks per second without Numpy.

whalesalad · on Nov 28, 2023

the minimap on this page is so cool