Are there any good real-world applications that take heavy advantage of SIMD? I imagine it would be very prolific given the benefits offered but SIMD, but I honestly have no idea.
This is something that I've been torn on with all of the M1 benchmarks. All of the benchmarks that are saying "the M1 is so much better than my Intel machine at video work" are all taking advantage of hardware video encode / decode blocks in the M1 (and unified memory between the GPU and video codecs).
Discounting their existence is entirely unfair, as one of the whole points of Apple Silicon is to give Apple the opportunity to put whatever hardware into their computer that accelerates the use cases they envision for their computers. Dedicated hardware is way more power efficient that software implementations.
However, what happens if you work in a video codec that Apple didn't build in hardware support for? Software video codecs depend heavily on SIMD instructions to be performant.
In the first place, using the hardware encoder is only feasible if the output is up to your quality/size standards and is compatible with the decoders that are going to consume your content. If your goal is to quickly render near-lossless mp4/mkv files for uploading to youtube, any regular old hardware encoder is probably fine. If your goal is to render out 6000kbps footage to store on your own CDN, the quality per bit becomes EXTREMELY IMPORTANT and suddenly it may not be feasible to use a particular hardware encoder.
FWIW, NVIDIA has made significant improvements to quality for their hardware encoders in each of their last 3 generations, and you definitely saw reviewers and creatives talking about that in particular when it came to purchasing decisions.
Apple's encoder is probably quite good at least, but I don't think it's meaningful to consider it for most benchmarks. The scenarios where you both are willing to use the hardware encoder and care about how fast it is are relatively few and far between - if you're just doing a zoom call all that matters is whether it can pump out 60fps and how good it looks, not whether it uses 3% cpu instead of 5%. I'd rather see quality/bitrate comparisons of their encoder with x264, not benchmarks.
x264 and x265 on my M1 Mac mini perform at least as well as my i9 16" MBP with the same settings. Neither are using any of the hardware acceleration available to either CPU. The M1 also does well with FCP which is cool but the software encoding with the above tools is really impressive.
In real world though anything but the lowest end hardware will have cryptographic offloads either in the CPU or storage controller (or both). The M1 actually excels at AES throughput for instance.
For esoteric/custom crypto it could play a part though but you have to have good reasons to not want to use standard crypto at higher speed for it to be your use case which is why I say it'd be uncommon.
ripgrep, and to a lesser extent, GNU grep both do. Whenever you run a query and it seems to execute very quickly, it's almost certainly because of SIMD. GNU grep will use SIMD somehow in many patterns. ripgrep uses it in even more.
Does GNU grep actually make explicit use of SIMD via intrinsics or assembly, or just through autovectorization and/or calling libc methods like memchr that are vectorized under the covers?
Yeah, I was being a bit succinct. As far as I'm aware, GNU grep has no explicit SIMD in it other than memchr through glibc. While some libc implementations utilize auto-vectorization of sorts (musl comes to mind), glibc does have Assembly implementations of memchr for several platforms that do indeed make use of SIMD explicitly.
ripgrep does the same, except for Intel at least, its memchr is implemented in Rust using SIMD intrinsics explicitly. And it also has a specialized SIMD algorithm (taken from Hyperscan mostly) for dealing with multiple patterns: https://github.com/BurntSushi/aho-corasick/tree/8b479a60906d...
Hyperscan takes this to a different level though. It has oodles more SIMD. I should have mentioned it in my original comment.
I was curious mostly because I never recalled any SIMD intrinsics in GNU code (ok, probably GIMP has them, so maybe I should say GNU utilities), so that would be a first.
It's interesting how much stuff leans on memchr, shame there aren't systematic wider versions taking more bytes to avoid false positives for longer literals (ignoring wmemchr): these could be nice and fast with SIMD.
Yeah I think the wider versions get a lot more complicated. memchr is a bit of a sweet spot, since its implementation is relatively simple. And things like glibc end up implementing specialized versions of it for most architectures _and_ instruction set extensions. (So e.g., there's one for SSE2 and for AVX on x86_64.)
And then of course there's PCMPESTRI (and its variants), but that has largely been a failure because of its high latency. :-( That's a shame, because that instruction does accept substrings up to 16 bytes.
Yeah I had some kind of brain fart thinking say a 4-character memchr() could be just as fast using the native method, but no of course it's 4x as slow (only an "aligned" memchr() like wmemchr() works like that). So yeah, it starts to get complicated quite if you want it to be fast.
Raytracing (Ray / BVH traversal / bbox intersection, and ray primitive intersection can all be SIMD-fied fairly well (very well up to 4-wide, fairly well up to 8-wide, depending on how things are packed).
SIMDs are big in computer graphics – you'd expect them to be used heavily in physics simulation and CPU-based rendering engines. Would be interesting to see how blender's Cycles or say Arnold Renderer performs on M1
It's very common in image and audio processing algorithms, and overall lots of general-purpose libraries used in software will use SIMD instruction sets. Even things like memcpy or String.IndexOf are vectorized in modern runtimes. IIRC Facebook released a very carefully tuned hashmap that uses SIMD instructions to do many of its search operations.
Jpegs (libjpegturbo), video decode/encode (libavcodec), encryption/decryption, ZFS filesystem, any workload that does high throughput. SIMD is the next generation of computing, without it we would have very very slow computers. Technologies like HEVC don't work well without SIMD, they're designed around it. Take a look into ffmpeg and see the tremendous amount of hand written simd assembly for various platforms.
All of them? I'd need a source on that. Also how do you think these algorithms became popular in the first place? If implimenting an algorithm around SIMD didn't give proper performance then they wouldn't exist.
You have to understand that current software exists because it solved a goal. If you have to first have your algorithm implimented in hardware then nobody can make anything. Without SIMD these projects simply wouldn't be able to exist. That's the hard math of it.
Anything that does number crunching can benefit from SIMD (just look at pretty much any modern compiler output on godbolt, icache be damned).
The tradeoff with using it is basically between instruction density, power (AVX-512), and latency (GPUs are seriously powerful, but getting the data going takes time and a lot of driver bullying).
If neural net inference is a small part of a larger computation, doing it locally (in cache) on the CPU (with AVX-512 instructions) can be a big win (for example, https://NN-512.com)