Intel’s next generation Arrow Lake CPUs are supposed to remove hyperthreading (i...

PaulKeeble · 2024-07-28T18:50:42 1722192642

Most programs end up with some limitation on the number of threads they can reasonably used. When you have a lot less Cores than that SMT makes a lot of sense to better utilise the resources of the CPU. However once you get to the point where you have enough cores SMT no longer makes any sense. I am not convinced we are necessarily there yet but the P/E cores Intel are using are an alternative towards a similar goal and makes a lot of sense on the desktop given how many workloads are single/low threaded. I can see the value in not having to deal with SMT and E core distinctions in application optimisation.

AMD on the other hand intends to keep mostly homogenous cores for now and continue to use SMT. I doubt its going to be simple to work out which strategy in practice is the best, its going to vary widely by application.

variadix · 2024-07-28T19:16:24 1722194184

It is my understanding that SMT should be beneficial regardless of core count, as SMT should enable two threads that can stall waiting for memory fetches to fully utilize a single ALU, i.e. SMT improves ALU utilization in memory bound applications with multiple threads by interleaving ALU usage when each thread is waiting on memory. Maybe larger caches are reducing the benefits of SMT, but it should be beneficial as long as there are many threads who are generally bound by memory latency.

adrian_b · 2024-07-29T00:16:13 1722212173

In a CPU with many cores, when some cores stall by waiting for memory loads, other cores can proceed by using data from their caches and this is even more likely to happen than for the SMT threads that share the same cache memory.

When there are enough cores, they will keep the common memory interface busy all the time, so adding SMT is unlikely to increase the performance in a memory-throughput limited application when there already are enough cores.

Keeping busy all the ALUs in a compute-limited application can usually be done well enough by out-of-order execution, because the modern CPUs have very big execution windows from which to choose instructions to be executed.

So when there already are many cores, in many cases SMT may provide negligible advantages. On server computers there are much more opportunities for SMT to improve their efficiency, but on non-server computers I have encountered only one widespread application for which SMT is clearly beneficial, which is the compilation of big software projects (i.e. with thousands of source files).

The big cores of Intel are optimized for single-thread performance. This optimization criterion results in bad multi-threaded performance. The reason is that the MT performance is limited by the maximum permissible chip area and by the maximum possible power consumption. A big core has very poor performance per area and performance per power ratios.

Adding SMT to such a big core improves the multi-threaded performance, but it is not the best way of improving it, because in the same area and power consumption used by a big core one can implement 3 to 5 efficient cores, so such a replacement of a big core with multiple efficient cores will increase the multi-threaded performance much more than adding SMT. So unlike for the case of a CPU that uses only big cores, in hybrid CPUs, SMT does not make sense, because a better MT performance is obtained by keeping only a few big cores, to provide high single-thread performance, and by replacing the other big cores with smaller, more efficient cores.

t-3 · 2024-07-28T19:49:10 1722196150

> Maybe larger caches are reducing the benefits of SMT, but it should be beneficial as long as there are many threads who are generally bound by memory latency.

I thought the reason SMT sometimes resulted in lower performance was that it halved the available cache per thread though - shouldn't larger caches make SMT more effective?

jmb99 · 2024-07-28T21:23:35 1722201815

My understanding is that a larger cache can make SMT more effective, but like usual, only in certain cases.

Let’s imagine we have 8 cores with SMT, and we’re running a task that (in theory) scales roughly linearly up to 16 threads. If each thread’s working memory is around half as much as there is cache available to each thread, but each working set is only used briefly, then SMT is going to be hugely beneficial: while one hyperthread is committing and fetching memory, the other one’s cache is already full with a new working set and can begin computing. Increasing cache will increase the allowable working set size without causing cache contention between hyperthreads.

Alternatively, if the working set is sufficiently large per thread (probably >2/3 the amount of cache available), SMT becomes substantially less useful. When the first hyperthread finishes its work, the second hyperthread has to still wait for some (or all) of its working set to be fetched from main memory (or higher cache levels if lucky). This may take just as long as simply keeping hyperthread #1 fed with new working sets. Increasing cache in this scenario will increase SMT performance almost linearly, until each hyperthread’s working set can be prefetched into the lowest cache levels while the other hyperthread is busy working.

Also consider the situation where the working set is much, much smaller than the available cache, but lots of computing must be done to it. In this case, a single hyperthread can continually be fed with new data, since the old set can be purged to main memory and the next set can be loaded into cache long before the current set is processed. SMT provides no benefit here no matter how large you grow the cache (unless the tasks use wildly different components of the core and they can be run at instruction-level parallelism - but that’s tricky to get right and you may run into thermal or power throttling before you can actually get enough performance to make it worthwhile).

Of course the real world is way more complicated than that. Many tasks do not scale linearly with more threads. Sometimes running on 6 “real” cores vs 12 SMT threads can result in no performance gain, but running on 8 “real” cores is 1/3 faster. And sometimes SMT will give you a non-linear speedup but a few more (non-SMT) cores will give you a better (but still non-linear) speedup. So short answer: yes, sometimes more cache makes SMT more viable, if your tasks can be 2x parallelized, have working sets around the size of the cache, and work on the same set for a notable chunk of the time required to store the old set and fetch the next one.

And of course all of this requires the processor and/or compiler to be smart enough to ensure the cache is properly fed new data from main memory. This is frequently the case these days, but not always.

gpderetta · 2024-07-28T23:48:16 1722210496

Let's say your workload consists solely in traversing a single linked list. This list fits perfectly in L1.

As an L1 load takes 4 cycles and you can't start the next load untill you completed the previous one, the CPU will stall doing nothing 3/4th of cycles. A 4-way SMT could in principle make use of all the wasted cycles.

Of course no load is even close to purely traversing a linked list, but a lot of non-hpc real world load do spend a lot of time in latency limited sections that can benefit from SMT, so it is not just cache misses.

jmb99 · 2024-07-29T04:55:00 1722228900

> so it is not just cache misses.

Agreed 100%. SMT is waaaay more complex than just cache. I was just trying to illustrate in simple scenarios where increasing cache would and would not be beneficial to SMT.

adastra22 · 2024-07-29T05:21:37 1722230497

Depends greatly on the work load.

hinkley · 2024-07-28T20:25:52 1722198352

I’d like to see the math for why it doesn’t work out to have a model where n real cores share a set of logic units for rare instructions and a few common ones where say the average number of instructions per clock is 2.66 so four cores each have 2 apiece and then share 3 between them.

When this whole idea first came up that’s how I thought it was going to work, but we’ve had two virtual processors sharing all of their logic units in common instead.

adrian_b · 2024-07-28T23:18:57 1722208737

It is difficult to share an execution unit between many cores, because in that case it must be placed at a great distance from at least the majority of those cores.

The communication between distant places on a chip requires additional area and power consumption and time. It may make the shared execution unit significantly slower than a non-shared unit and it may decrease the PPA (performance per power and area) of the CPU.

Such a sharing works only for a completely self-contained execution unit, which includes its own registers and which can perform a complex sequence of operations independently of the core that has requested it. In such cases the communication between a core and the shared unit is reduced to sending the initial operands and receiving the final results, while between these messages the shared unit operates for a long time without external communication. An example of such a shared execution unit is the SME accelerator of the Apple CPUs, which executes matrix operations.

gpderetta · 2024-07-28T23:59:51 1722211191

Aside for large vector ALUs, execution units are cheap in transistor count. Caches, TLBs, memory for the various predictors, register files, reorder buffers, cost probably a significantly larger amount of transistors.

In any case, execution units are clusters around ports, so, AFAIK, you wouldn't really be able to share at the instruction level, bu only groups of related instructions.

Still, some sharing is possible, AMD tried to share FPUs in Bulldozer, but it didn't work well. Some other CPUs share cryptographic accelerators. IIRC Apple shares the matrix unit across cores.

hinkley · 2024-07-29T04:27:46 1722227266

Are predictors and TLBs shared between SMTs?

gpderetta · 2024-07-29T06:12:19 1722233539

As far as I know yes, most structures are dynamically shared between hypertreads.

YesBox · 2024-07-28T22:12:46 1722204766

Im creating a game + engine and speaking from personal experience/my use case, hyperthreading was less performant than (praying to the CPU thread allocation god) each thread utilizing its own core. I decided to max out the number of threads by using std::thread::hardware_concurrency() / 2 - 1. (i.e. number of cores - 1 ).

I'm working with a std::vector

hinkley · 2024-07-28T20:20:04 1722198004

On common industry benchmarks at least every second generation of Intel hyperthreading ended up being slower than turning it off. Even when it worked it was barely double digit percent improvements, and there were periods when it was worse for consecutive generation. Why do they keep trying?

fulafel · 2024-07-30T04:47:27 1722314847

On the other hand a lot of use cases get high speedups from SMT (eg "AMD’s implementation of simultaneous multithreading appears to be power efficient as it achieves 41% additional requests per second while consuming only 7% more power compared to the skewed baseline" in https://blog.cloudflare.com/measuring-hyper-threading-and-tu...)

Seems Intel is maybe just not very good at it.

tedunangst · 2024-07-28T20:31:00 1722198660

Because the benchmarks don't measure what people do with computers.

p_l · 2024-07-29T12:31:15 1722256275

Most benchmarks are fine-tuned code that provides pretty much the perfect case for running with SMT off, because the real-world use cases that benefit from SMT are absent in those benchmarks.

gpderetta · 2024-07-28T23:39:14 1722209954

Even on server parts?