Are there any good real-world applications that take heavy advantage of SIMD? I ...

viraptor · on Dec 13, 2020

Most serious software video/audio codecs will try to use those as much as possible. For example x264 https://github.com/corecodec/x264/blob/master/common/x86/x86...

Teknoman117 · on Dec 13, 2020

This is something that I've been torn on with all of the M1 benchmarks. All of the benchmarks that are saying "the M1 is so much better than my Intel machine at video work" are all taking advantage of hardware video encode / decode blocks in the M1 (and unified memory between the GPU and video codecs).

Discounting their existence is entirely unfair, as one of the whole points of Apple Silicon is to give Apple the opportunity to put whatever hardware into their computer that accelerates the use cases they envision for their computers. Dedicated hardware is way more power efficient that software implementations.

However, what happens if you work in a video codec that Apple didn't build in hardware support for? Software video codecs depend heavily on SIMD instructions to be performant.

kevingadd · on Dec 13, 2020

In the first place, using the hardware encoder is only feasible if the output is up to your quality/size standards and is compatible with the decoders that are going to consume your content. If your goal is to quickly render near-lossless mp4/mkv files for uploading to youtube, any regular old hardware encoder is probably fine. If your goal is to render out 6000kbps footage to store on your own CDN, the quality per bit becomes EXTREMELY IMPORTANT and suddenly it may not be feasible to use a particular hardware encoder.

FWIW, NVIDIA has made significant improvements to quality for their hardware encoders in each of their last 3 generations, and you definitely saw reviewers and creatives talking about that in particular when it came to purchasing decisions.

Apple's encoder is probably quite good at least, but I don't think it's meaningful to consider it for most benchmarks. The scenarios where you both are willing to use the hardware encoder and care about how fast it is are relatively few and far between - if you're just doing a zoom call all that matters is whether it can pump out 60fps and how good it looks, not whether it uses 3% cpu instead of 5%. I'd rather see quality/bitrate comparisons of their encoder with x264, not benchmarks.

giantrobot · on Dec 13, 2020

x264 and x265 on my M1 Mac mini perform at least as well as my i9 16" MBP with the same settings. Neither are using any of the hardware acceleration available to either CPU. The M1 also does well with FCP which is cool but the software encoding with the above tools is really impressive.

espadrine · on Dec 13, 2020

A heavy SIMD user is in all things cryptographic.

You can read what Google did on Android phones to encrypt drives here for instance: https://lwn.net/Articles/776721/

zamadatix · on Dec 13, 2020

In real world though anything but the lowest end hardware will have cryptographic offloads either in the CPU or storage controller (or both). The M1 actually excels at AES throughput for instance.

For esoteric/custom crypto it could play a part though but you have to have good reasons to not want to use standard crypto at higher speed for it to be your use case which is why I say it'd be uncommon.

espadrine · on Dec 13, 2020

That is especially true for algorithms that are not so easy to put in SIMD.

Modern designs tend to be very SIMD-aware, such as BLAKE3[0] and Gimli[1].

[0]: https://github.com/BLAKE3-team/BLAKE3/blob/master/c/README.m...

[1]: § 5.6 https://gimli.cr.yp.to/gimli-20170627.pdf

chongli · on Dec 13, 2020

Isn’t that already handled by crypto instructions on the chip?

burntsushi · on Dec 13, 2020

ripgrep, and to a lesser extent, GNU grep both do. Whenever you run a query and it seems to execute very quickly, it's almost certainly because of SIMD. GNU grep will use SIMD somehow in many patterns. ripgrep uses it in even more.

BeeOnRope · on Dec 14, 2020

Does GNU grep actually make explicit use of SIMD via intrinsics or assembly, or just through autovectorization and/or calling libc methods like memchr that are vectorized under the covers?

burntsushi · on Dec 14, 2020

Yeah, I was being a bit succinct. As far as I'm aware, GNU grep has no explicit SIMD in it other than memchr through glibc. While some libc implementations utilize auto-vectorization of sorts (musl comes to mind), glibc does have Assembly implementations of memchr for several platforms that do indeed make use of SIMD explicitly.

ripgrep does the same, except for Intel at least, its memchr is implemented in Rust using SIMD intrinsics explicitly. And it also has a specialized SIMD algorithm (taken from Hyperscan mostly) for dealing with multiple patterns: https://github.com/BurntSushi/aho-corasick/tree/8b479a60906d...

Hyperscan takes this to a different level though. It has oodles more SIMD. I should have mentioned it in my original comment.

BeeOnRope · on Dec 14, 2020

Thanks for the answer.

I was curious mostly because I never recalled any SIMD intrinsics in GNU code (ok, probably GIMP has them, so maybe I should say GNU utilities), so that would be a first.

It's interesting how much stuff leans on memchr, shame there aren't systematic wider versions taking more bytes to avoid false positives for longer literals (ignoring wmemchr): these could be nice and fast with SIMD.

burntsushi · on Dec 15, 2020

Yeah I think the wider versions get a lot more complicated. memchr is a bit of a sweet spot, since its implementation is relatively simple. And things like glibc end up implementing specialized versions of it for most architectures _and_ instruction set extensions. (So e.g., there's one for SSE2 and for AVX on x86_64.)

And then of course there's PCMPESTRI (and its variants), but that has largely been a failure because of its high latency. :-( That's a shame, because that instruction does accept substrings up to 16 bytes.

BeeOnRope · on Dec 21, 2020

Yeah I had some kind of brain fart thinking say a 4-character memchr() could be just as fast using the native method, but no of course it's 4x as slow (only an "aligned" memchr() like wmemchr() works like that). So yeah, it starts to get complicated quite if you want it to be fast.

berkut · on Dec 13, 2020

Raytracing (Ray / BVH traversal / bbox intersection, and ray primitive intersection can all be SIMD-fied fairly well (very well up to 4-wide, fairly well up to 8-wide, depending on how things are packed).

haxiomic · on Dec 13, 2020

SIMDs are big in computer graphics – you'd expect them to be used heavily in physics simulation and CPU-based rendering engines. Would be interesting to see how blender's Cycles or say Arnold Renderer performs on M1

kevingadd · on Dec 13, 2020

It's very common in image and audio processing algorithms, and overall lots of general-purpose libraries used in software will use SIMD instruction sets. Even things like memcpy or String.IndexOf are vectorized in modern runtimes. IIRC Facebook released a very carefully tuned hashmap that uses SIMD instructions to do many of its search operations.

The_rationalist · on Dec 13, 2020

Source for the Facebook hashmap? Is is faster than swisstable? Edit: found it https://news.ycombinator.com/item?id=19759630

devwastaken · on Dec 13, 2020

Jpegs (libjpegturbo), video decode/encode (libavcodec), encryption/decryption, ZFS filesystem, any workload that does high throughput. SIMD is the next generation of computing, without it we would have very very slow computers. Technologies like HEVC don't work well without SIMD, they're designed around it. Take a look into ffmpeg and see the tremendous amount of hand written simd assembly for various platforms.

garmaine · on Dec 13, 2020

Those are all implemented in dedicated hardware on Apple’s chip though.

devwastaken · on Dec 13, 2020

All of them? I'd need a source on that. Also how do you think these algorithms became popular in the first place? If implimenting an algorithm around SIMD didn't give proper performance then they wouldn't exist.

You have to understand that current software exists because it solved a goal. If you have to first have your algorithm implimented in hardware then nobody can make anything. Without SIMD these projects simply wouldn't be able to exist. That's the hard math of it.

mhh__ · on Dec 13, 2020

Anything that does number crunching can benefit from SIMD (just look at pretty much any modern compiler output on godbolt, icache be damned).

The tradeoff with using it is basically between instruction density, power (AVX-512), and latency (GPUs are seriously powerful, but getting the data going takes time and a lot of driver bullying).

gpapilion · on Dec 13, 2020

Memcopy.

These instructions can be beneficial when leveraged providing significant speed ups for larger transfers.

37ef_ced3 · on Dec 13, 2020

If neural net inference is a small part of a larger computation, doing it locally (in cache) on the CPU (with AVX-512 instructions) can be a big win (for example, https://NN-512.com)

gumby · on Dec 13, 2020

Most FORTRAN code is highly vector-oriented, and many numerical systems like R need FORTRAN.

umanwizard · on Dec 13, 2020

PyTorch and NumPy make heavy use of SIMD for most tensor/ndarray operations.

makapuf · on Dec 13, 2020

When done on cpu ; the computation done on the gpu can be the majority for pytorch.

umanwizard · on Dec 13, 2020

Indeed -- but it's worth noting out that GPU code is closely analogous to SIMD, anyway.

(And PyTorch can be and often is run on pure CPU.)

blake1 · on Dec 13, 2020

I do lots of simulation, and make extremely heavy use of SIMD.

pengaru · on Dec 13, 2020

Do video games qualify? Anything that does stuff like lots of matrix multiplication is a prime candidate for SIMD optimization.

sulam · on Dec 13, 2020

In games, most of that kind of work is off-loaded to the GPU. In some non-games, too.

pengaru · on Dec 13, 2020

Sure, but even when GPU accelerated you often need to apply transformations in the CPU domain for things like collision/physics.

robert_foss · on Dec 13, 2020

Video en/de-coding

gameswithgo · on Dec 13, 2020

a ton of the .net core libraries use simd now, ripgrep uses it, json parsing, video games

jamesnaismith · on Dec 13, 2020

Biological sequence alignment.