Exploring SIMD performance improvements in WebAssembly (2021)

fwsgonzo · on Feb 2, 2022

I compared against native:

  #define ITERATIONS 1000

  int main()
  {
    const size_t BUFFER_SIZE = 64ul \* 1024 \* 1024;
    __m128i\* data_buffer = (__m128i *)memalign(64, BUFFER_SIZE);

    const __m128i all_ones = _mm_set1_epi8(0xFF);

    for (size_t i = 0; i < ITERATIONS; i++)
    {
      __m128i* data = data_buffer;
      for (size_t b = 0; b < BUFFER_SIZE;) {
        _mm_stream_si128(&data[0], all_ones);
        _mm_stream_si128(&data[1], all_ones);
        _mm_stream_si128(&data[2], all_ones);
        _mm_stream_si128(&data[3], all_ones);
        data += 4;
        b += 16 * 4;
      }
    }
  }


  $ time ./fill_buffer.elf 
  real 0m1,832s


  $ time ./wasmer fill_buffer.wasm -i fillBufferWithSIMD 1000
  real 0m4,237s

I had to fixup the WAT because set_local and get_local don't exist anymore. They are called local.get and local.set now.

At higher number of iterations the C version converges on about 1.7 seconds per 1000, while the WASM version seems to remain the same at 4.2 secs per 1000. This leaves native 2.5x faster for this particular operation, on my machine.

syrusakbary · on Feb 3, 2022

Hi, I'm Syrus from Wasmer.

Have you tried with the llvm backend? I believe the results might be even better there!

  $ time ./wasmer run --llvm fill_buffer.wasm -i fillBufferWithSIMD 1000

fwsgonzo · on Feb 3, 2022

$ time ./wasmer fill_buffer.wasm --llvm -i fillBufferWithSIMD 1000

real 0m5,233s

It seems to be a 25% longer run-time with LLVM backend. I did not know about that option! Very interesting.

syrusakbary · on Feb 3, 2022

It's strange that llvm runs slower (perhaps the of the 5s some of that is spent being compiled).

Could you share the fill_buffer.wasm (or the WAT file) so I can do some tests? Thanks!

fwsgonzo · on Feb 4, 2022

Here you go: https://gist.github.com/fwsGonzo/2968cf0bc3364eb1ff0e0500569...

Hopefully I did not mess anything up!

jcelerier · on Feb 2, 2022

very depressing when for so many use cases even native performance is very very very much not fast enough

kevingadd · on Feb 3, 2022

In many of these use cases native is only "not fast enough" because the code you're running makes very poor use of the cache, pipelining, simd instruction sets, and memory bandwidth

Dylan16807 · on Feb 2, 2022

> when

If native performance is "very very very" not fast enough then that's supercomputer work and it doesn't really matter if WASM is 3x native or 0.3x native. So that context should be where you're the least depressed.

jcelerier · on Feb 3, 2022

> then that's supercomputer work

today's laptop work is late 90's supercomputer's work (and it was even more depressing back then).

Dylan16807 · on Feb 3, 2022

And today's supercomputer work was impossible in the late 90's.

That doesn't really change my argument. When you're looking at languages that are used for small tasks today, their speed doesn't have much relevance to how vastly bigger tasks are accomplished. And by the time those tasks can be run on a laptop, WASM implementations are going to be much better and we still might not be using it at all for those larger tasks.

hackthesystem · on Feb 2, 2022

Thanks for mentioning, I just updated my website and Github to use the new WAT functions.

akireu · on Feb 2, 2022

It looks promising! But fixed-width lanes don't seem too cross-platform? I don't just mean the v256 and v512 types that may become ubiquitous in a few years, but also things like optimizing for different L1 cache sizes, doing some operation macro-fusion on the SIMD unit, or directly supporting leading/trailing elements to reduce code size?

kevingadd · on Feb 2, 2022

In practice it's not possible to optimize "generally" for all possible target architectures your wasm will run on. You're going to optimize for x86-64 or ARM, and probably going to specifically optimize for modern intel, modern amd, or apple's m1. If you try to optimize for everything you're going to run into really painful tradeoffs and probably have mediocre performance on a bunch of architectures after a lot of hard work.

skywal_l · on Feb 3, 2022

Wouldn't it be possible to have a binary containing multiple versions of your program compiled optimized for various CPU configuration and have a switch at runtime which would select depending on your CPUid. I think intel have a compiler for that.

janwas · on Feb 4, 2022

Yes, our github.com/google/highway does that for SSE4/AVX2/AVX-512. It targets at the level of instruction sets, though, not specific microarchitectures.

akireu · on Feb 2, 2022

Why not? Fixed-size SIMD architectures use mostly the same operations, so if you target SSE2 initially, the code should run just fine on NEON. A runtime that ships a JIT compiler also has the unique opportunity to further optimize SIMD code by using more lanes or limiting the working set to the host platform's L1 cache size. Even the AOT compilers like GCC or clang emulate platform-specific intrinsics using generic vector ones. This should count for something, no?

TinkersW · on Feb 2, 2022

They are similar but not the same, for instance SSE has movemask, but NEON does not, so it gets emulated(slowly) when targeting that platform. The cross lane ops are different enough that you might need to rewrite for other platforms. And then you run into situations where an instruction is very fast on one architecture but horribly slow on another because its basically emulated.

akireu · on Feb 2, 2022

This isn't really relevant to wasm, though. You can't expect it to support platform-specific hacks just for SIMD, so you'll have to make do with the lowest common denominator anyway.

hackthesystem · on Feb 2, 2022

Thanks! Good points, I think in general the fixed-width "packed" SIMD ISAs have the downsides that you mentioned.

But it seems that WebAssembly doesn't have length-agnostic SIMD instructions yet. There is an open proposal to add this though: https://github.com/WebAssembly/flexible-vectors

jedisct1 · on Feb 3, 2022

To compile code to WebAssembly with SIMD instructions, just use the generic+simd128 CPU target:

    zig cc  -Ofast --target=wasm32-wasi -mcpu=generic+simd128 example.c
    zig c++ -Ofast --target=wasm32-wasi -mcpu=generic+simd128 example.cpp

Or with a build file:

    zig build -Drelease-fast -Dtarget=wasm32-wasi -Dcpu=generic+simd128

akireu · on Feb 2, 2022

Did a similar test in plain C: https://godbolt.org/z/ffYcWhxz3 It's not quite the same: I've used an increment instead of zeroing, otherwise the entire benchmark gets optimized away. Still got just about the same result (3.7x speedup for 100 iterations), so wasm did good there. Actually, now that I think of it, SIMD code performance probably depends on good register allocation more than on any optimization.

nyanpasu64 · on Feb 2, 2022

I wonder if the 4x performance boost of 128-bit SIMD over 32-bit would drop to 2x if WebAssembly added 64-bit registers/instructions.

hackthesystem · on Feb 2, 2022

I actually tried comparing 128-bit SIMD to the 64-bit performance and the difference was 2x. I only published the results for the 4x comparison, but it should be pretty easy to reproduce if you change the types in the non-SIMD code[1] from i32 -> i64.

[1] https://github.com/awelm/simd-wasm-profiling/blob/master/fil...

nynx · on Feb 2, 2022

Webassembly has 64-bit locals and instructions.