Compared to GPU programming the gains from SIMD are limited but it's a small-mul...

corysama · 2025-10-09T04:12:34 1759983154

> a small-multiple boost

Quick reminder that a 20x boost is better than going from O(n) to O(log n) for up to a million items. And, that log n algorithms often are simply not possible for many problems.

ccapitalK · 2025-10-09T04:35:27 1759984527

Am I getting the math wrong here? Going from O(n) to O(log n) (with no change in constant factor) for a million items would be going from ~1000000c ops to ~20c ops, which would be a 50000x improvement?

DarkNova6 · 2025-10-09T07:32:41 1759995161

Big O Notation omis constant factors that tend to be significantly larger for log-n algorithms.

I think he talked from personal experience.

f1shy · 2025-10-09T09:49:39 1760003379

Could be. But very poorly stated if so.

Anyway, I do not think that even "typically" such statement can remotely be truth. It is 2 orders of magnitude away (20 to 5000).

DarkNova6 · 2025-10-09T10:33:20 1760006000

True, but it wouldn’t be unthinkable. Especially if the O(n) algorithm accesses data sequentially and the O(log n) has indirections.

But maybe the author simply made it up.

corysama · 2025-10-09T13:31:12 1760016672

You are all right. I was wrong. Should not post math just before going to sleep… So, let’s make the same mistake first thing in the morning:

x=20log2(x) at 143

https://www.wolframalpha.com/input?i=x%3D20*log2%28x%29

zozbot234 · 2025-10-08T21:21:48 1759958508

The WASM folks should just include an arbitrary-length vector compute extension. We should also explore automatically compiling WASM to GPU compute as appropriate, the hardware independence makes it a rather natural fit for that.

galangalalgol · 2025-10-09T11:29:15 1760009355

Aren't they loosely modeled on riscV? RiscV has an arbitrary length vector extension. So did cray. They seem like a good idea. In practice, I usually structure intermediate data products in memory with the largest simd size I expect the vectorizer will ever optimize for, so compiling it for something longer won't get me anything.

adrian_b · 2025-10-09T10:53:40 1760007220

You mean the gains from SIMD on the CPU, because all GPU programs use even wider SIMD than the CPUs, with width of 1024 bits or 2048 bits, for even greater die area and power consumption savings in comparison with non-SIMD.

The width of the SIMD instructions is not visible when programming with NVIDIA CUDA or with the similar compilers for Intel CPUs (ispc or oneAPI with SYCL targeting CPUs or OpenMP with appropriate pragmas), but only because the compiler takes care of that.

ncruces · 2025-10-08T22:08:36 1759961316

I merged a few PRs to SIMD optimize Wasm WASI libc, but it all got stalled in str(c)spn (which is slightly more sophisticated than the rest).

There wasn't much appetite for any of it on Emscripten.

https://github.com/WebAssembly/wasi-libc/pulls?q=is%3Apr+opt...

jasonthorsness · 2025-10-08T22:54:51 1759964091

subscribed to the str(c)spn thread for the eventual explanation of why the non-simd version seemed to give the wrong answer

ncruces · 2025-10-09T07:44:55 1759995895

Very likely the issue is with the test infrastructure and not with the code under test.

My own fuzzing doesn't report any inconsistencies. But fuzzing is always necessarily incomplete.

fulafel · 2025-10-09T05:04:03 1759986243

The high arithmetic bandwidth on GPUs is of course SIMD based as well. They just tend to have a ISPC style compilation model that doesn't expose the SIMD lanes in the source code. (Whereas SIMD even after decades is very lightly utilized by compilers on the CPU side).

zozbot234 · 2025-10-09T09:07:05 1760000825

It's SIMD-based at the lowest level, but there's also the use of very high hardware multithreading (the threads are called, AIUI, "wavefronts" or "warps") on each compute unit/stream processor to hide memory access latency. Recent SPARC CPU's have 8-way hardware multithreading on the individual CPU core, GPU's can easily go even higher than that.

fulafel · 2025-10-09T14:28:03 1760020083

Yep, this also reflects the design target of GPUs targeting much larger working sets, so have higher main memory bandwidth and rely less on caches. CPUs rather have few fast threads of execution working on hot cached data than many slow ones talking to main memory (because N-way thread level parallelism often splits your cache N ways, to N working sets)