I mean why isn’t multithreading the answer?

samus · on Nov 12, 2023

Multithreading has the following disadvantages:

* Overhead: the overhead to start and manage multiple threads is considerable in practice. Most multithreaded algorithms are in fact slower than optimized serial implementations when n_threads=1

* Communication: threads have to communicate with each other and synchronize access to shared resources. "Embarrassingly parallel" problems don't require synchronization, but many interesting problems are not of that kind.

* Amdahl's law: there is a point of diminishing returns on parallelizing an application since it quite likely contains parts that are not easily parallelized.

anticensor · on Nov 16, 2023

You forgot Wirth's law: software complexity compensates increase in speed of hardware.

bmc7505 · on Nov 12, 2023

Well, yes. But assuming the question is how to reduce overall latency, what alternative is there besides algorithmic improvements or increasing clock frequency?

samus · on Nov 13, 2023

Recently, SIMD has been increasingly used to make use of the increased number of compute units in a CPU. Sure, it's not as easy and straightforward to use as multithreading (I never expected to say that about multithreading), but libraries and programming languages make more and more use of them.

Edit: latency is difficult. But accellerating CPUs and using GPUs for compute was never about latency. Most I/O bottlenecks are because CPUs have sped up so much and left the rest of the platform in the dust. Much of it is also due to fundamental limitations due to the speed of light. Increasing throughput is always easier that reducing latency.

bmc7505 · on Nov 13, 2023

I guess I was thinking more of multithreading as an umbrella term for parallelism but I see your point. If you squint, SIMD and multithreading are the same problem. What I mean is, if you have a computable function you want to evaluate on a finite amount of bits, the problem of how to physically layout and schedule the computation to minimize latency is very similar to maximizing throughput.

From a circuit complexity standpoint, you could evaluate a wide but shallow circuit to evaluate the function in a nanosecond, or a deep but narrow circuit that takes eons. Whether parts of that circuit are evaluated synchronously or asynchronously is immaterial, although synchronous computation does seem easier to reason about and the UX is more user-friendly from a programmability standpoint.

I agree with you the fundamental limitation is the speed of light if you are width-bounded (i.e., if physical space is the dominating constraint).