Compared to GPU programming the gains from SIMD are limited but it's a small-multiple boost and available pretty much everywhere. C# makes it easy to use through Vector classes. WASM SIMD still has a way to go but even with the current 128-bit you can see dramatic improvements in some buffer-processing cases (I did a little comparison demo here showing a 20x improvement in bitwise complement of a large buffer: https://www.jasonthorsness.com/2)
Quick reminder that a 20x boost is better than going from O(n) to O(log n) for up to a million items. And, that log n algorithms often are simply not possible for many problems.
Am I getting the math wrong here? Going from O(n) to O(log n) (with no change in constant factor) for a million items would be going from ~1000000c ops to ~20c ops, which would be a 50000x improvement?
The WASM folks should just include an arbitrary-length vector compute extension. We should also explore automatically compiling WASM to GPU compute as appropriate, the hardware independence makes it a rather natural fit for that.
Aren't they loosely modeled on riscV? RiscV has an arbitrary length vector extension. So did cray. They seem like a good idea. In practice, I usually structure intermediate data products in memory with the largest simd size I expect the vectorizer will ever optimize for, so compiling it for something longer won't get me anything.
You mean the gains from SIMD on the CPU, because all GPU programs use even wider SIMD than the CPUs, with width of 1024 bits or 2048 bits, for even greater die area and power consumption savings in comparison with non-SIMD.
The width of the SIMD instructions is not visible when programming with NVIDIA CUDA or with the similar compilers for Intel CPUs (ispc or oneAPI with SYCL targeting CPUs or OpenMP with appropriate pragmas), but only because the compiler takes care of that.
The high arithmetic bandwidth on GPUs is of course SIMD based as well. They just tend to have a ISPC style compilation model that doesn't expose the SIMD lanes in the source code. (Whereas SIMD even after decades is very lightly utilized by compilers on the CPU side).
It's SIMD-based at the lowest level, but there's also the use of very high hardware multithreading (the threads are called, AIUI, "wavefronts" or "warps") on each compute unit/stream processor to hide memory access latency. Recent SPARC CPU's have 8-way hardware multithreading on the individual CPU core, GPU's can easily go even higher than that.
Yep, this also reflects the design target of GPUs targeting much larger working sets, so have higher main memory bandwidth and rely less on caches. CPUs rather have few fast threads of execution working on hot cached data than many slow ones talking to main memory (because N-way thread level parallelism often splits your cache N ways, to N working sets)