Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Compared to GPU programming the gains from SIMD are limited but it's a small-multiple boost and available pretty much everywhere. C# makes it easy to use through Vector classes. WASM SIMD still has a way to go but even with the current 128-bit you can see dramatic improvements in some buffer-processing cases (I did a little comparison demo here showing a 20x improvement in bitwise complement of a large buffer: https://www.jasonthorsness.com/2)


> a small-multiple boost

Quick reminder that a 20x boost is better than going from O(n) to O(log n) for up to a million items. And, that log n algorithms often are simply not possible for many problems.


Am I getting the math wrong here? Going from O(n) to O(log n) (with no change in constant factor) for a million items would be going from ~1000000c ops to ~20c ops, which would be a 50000x improvement?


Big O Notation omis constant factors that tend to be significantly larger for log-n algorithms.

I think he talked from personal experience.


Could be. But very poorly stated if so.

Anyway, I do not think that even "typically" such statement can remotely be truth. It is 2 orders of magnitude away (20 to 5000).


True, but it wouldn’t be unthinkable. Especially if the O(n) algorithm accesses data sequentially and the O(log n) has indirections.

But maybe the author simply made it up.


You are all right. I was wrong. Should not post math just before going to sleep… So, let’s make the same mistake first thing in the morning:

x=20log2(x) at 143

https://www.wolframalpha.com/input?i=x%3D20*log2%28x%29


The WASM folks should just include an arbitrary-length vector compute extension. We should also explore automatically compiling WASM to GPU compute as appropriate, the hardware independence makes it a rather natural fit for that.


Aren't they loosely modeled on riscV? RiscV has an arbitrary length vector extension. So did cray. They seem like a good idea. In practice, I usually structure intermediate data products in memory with the largest simd size I expect the vectorizer will ever optimize for, so compiling it for something longer won't get me anything.


You mean the gains from SIMD on the CPU, because all GPU programs use even wider SIMD than the CPUs, with width of 1024 bits or 2048 bits, for even greater die area and power consumption savings in comparison with non-SIMD.

The width of the SIMD instructions is not visible when programming with NVIDIA CUDA or with the similar compilers for Intel CPUs (ispc or oneAPI with SYCL targeting CPUs or OpenMP with appropriate pragmas), but only because the compiler takes care of that.


I merged a few PRs to SIMD optimize Wasm WASI libc, but it all got stalled in str(c)spn (which is slightly more sophisticated than the rest).

There wasn't much appetite for any of it on Emscripten.

https://github.com/WebAssembly/wasi-libc/pulls?q=is%3Apr+opt...


subscribed to the str(c)spn thread for the eventual explanation of why the non-simd version seemed to give the wrong answer


Very likely the issue is with the test infrastructure and not with the code under test.

My own fuzzing doesn't report any inconsistencies. But fuzzing is always necessarily incomplete.


The high arithmetic bandwidth on GPUs is of course SIMD based as well. They just tend to have a ISPC style compilation model that doesn't expose the SIMD lanes in the source code. (Whereas SIMD even after decades is very lightly utilized by compilers on the CPU side).


It's SIMD-based at the lowest level, but there's also the use of very high hardware multithreading (the threads are called, AIUI, "wavefronts" or "warps") on each compute unit/stream processor to hide memory access latency. Recent SPARC CPU's have 8-way hardware multithreading on the individual CPU core, GPU's can easily go even higher than that.


Yep, this also reflects the design target of GPUs targeting much larger working sets, so have higher main memory bandwidth and rely less on caches. CPUs rather have few fast threads of execution working on hot cached data than many slow ones talking to main memory (because N-way thread level parallelism often splits your cache N ways, to N working sets)




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: