I vaguely remember that it could allow high-end CPUs to use something similar to register renaming, i.e. stack locations like [rsp + 96] could stay in physical registers during code execution (high-end CPUs often have more physical registers, than logical ones), but I couldn't find a good reference confirming that such optimization is indeed used in practice.
> Intrinsic vmulq_f32(x, c0, 0) multiplies each lane of register x with lane 0 from register c0.
I don't remember seeing anything like that NEON instruction in AltiVec or SSE so it didn't even occur to me to look for it when I ported some SIMD code to NEON; now I'm going to go back and look for opportunities to use it!
That should be vmulq_lane_f32(), but yeah, lane broadcast is free on a number of NEON operations. Many operations also have built-in narrowing, widening, saturation, and rounding. One of the more ridiculous ones is vqrdmlah_lane_s16(), which translates to: signed saturating rounding doubling multiply accumulate returning high half (with a lane broadcast).
The downside is that the latencies can be a bit high sometimes compared to other CPUs. 128-bit vector integer adds, for instance, have 2c latency even on an Apple M1.
Another thing to watch out for is that some NEON guides are outdated and only tell you about ARMv7 features, missing some goodies added in ARMv8 like horizontal operations (vaddv) and rounding on conversions other than truncate.
On Intel/AMD you need AVX-512 support to get the instructions with broadcast (and many other goodies that are missing from SSE/AVX/AVX2).
Intel had such instructions many years before ARM (i.e. since Larrabee), but they have chosen to provide them only in their high-end CPUs, annoying the programmers that would like high performance but who do not like the burden of developing for a fragmented instruction set.
Whether the technique described here will actually be faster is pretty application-dependent. The problem is that, on x86, shuffle instructions are the bottleneck for many algorithms (at least the type that I often work with). Storing constants this way requires adding an extra shuffle each time that you need to broadcast one of the constants back to a vector register, which exacerbates the bottleneck. In these cases, I’ve found that light spilling to the stack actually performs better.
Yeah, I haven't checked within the last few years on more recent Intel/AMD processors, but it used to be that on Intel CPUs, only port 5 could be used for shuffles, so it was possible to bottleneck them on code with fairly heavy usage of shuffles.
This is probably the right place to grumble about the fact that it took Intel until ~2020 to stop selling chips that still don't have AVX/AVX2 in them, a solid decade and change after the tech was introduced.
According to the steam software survey as of Feb24, there's still almost 8% of the market that thus doesn't have access to AVX2. Which for many practical purposes means falling back to SSE for everything unless you want the complexity of dynamically dispatching functions according to instruction set and the complexities of it (https://news.ycombinator.com/item?id=24577069). Even for significant performance, I usually do not.
The reason that whole rant is relevant is that the trick in the article (which is simple and short) doesn't even have a cost in SSE code! For SSE you have to do a load and shuffle anyways since there is no lane broadcast instruction. (The set1_ps intrinsic hides this from you, since it will compile instead to a broadcast if you have set the compiler suitably). Since you're already shuffling, there's no downside to this trick, only reduced register pressure.
In the footnotes, there is an HTML injection after: "Consider the following code that calculates cosine using numerical methods:"
PS. Recently, we've tried to turn on and tested the impact of `-fno-omit-frame-pointer` in ClickHouse. While there is no noticeable change in most of the tests, some tests show up to 20% degradation due to less number of available registers.
I vaguely remember that it could allow high-end CPUs to use something similar to register renaming, i.e. stack locations like [rsp + 96] could stay in physical registers during code execution (high-end CPUs often have more physical registers, than logical ones), but I couldn't find a good reference confirming that such optimization is indeed used in practice.
Unfortunately, I think more often than note it causes performance regressions and in some cases it may even cause unnecessary stack spilling of sensitive data: https://github.com/rust-lang/rust/issues/88930#issuecomment-...