My understanding is that a larger cache can make SMT more effective, but like usual, only in certain cases.
Let’s imagine we have 8 cores with SMT, and we’re running a task that (in theory) scales roughly linearly up to 16 threads. If each thread’s working memory is around half as much as there is cache available to each thread, but each working set is only used briefly, then SMT is going to be hugely beneficial: while one hyperthread is committing and fetching memory, the other one’s cache is already full with a new working set and can begin computing. Increasing cache will increase the allowable working set size without causing cache contention between hyperthreads.
Alternatively, if the working set is sufficiently large per thread (probably >2/3 the amount of cache available), SMT becomes substantially less useful. When the first hyperthread finishes its work, the second hyperthread has to still wait for some (or all) of its working set to be fetched from main memory (or higher cache levels if lucky). This may take just as long as simply keeping hyperthread #1 fed with new working sets. Increasing cache in this scenario will increase SMT performance almost linearly, until each hyperthread’s working set can be prefetched into the lowest cache levels while the other hyperthread is busy working.
Also consider the situation where the working set is much, much smaller than the available cache, but lots of computing must be done to it. In this case, a single hyperthread can continually be fed with new data, since the old set can be purged to main memory and the next set can be loaded into cache long before the current set is processed. SMT provides no benefit here no matter how large you grow the cache (unless the tasks use wildly different components of the core and they can be run at instruction-level parallelism - but that’s tricky to get right and you may run into thermal or power throttling before you can actually get enough performance to make it worthwhile).
Of course the real world is way more complicated than that. Many tasks do not scale linearly with more threads. Sometimes running on 6 “real” cores vs 12 SMT threads can result in no performance gain, but running on 8 “real” cores is 1/3 faster. And sometimes SMT will give you a non-linear speedup but a few more (non-SMT) cores will give you a better (but still non-linear) speedup. So short answer: yes, sometimes more cache makes SMT more viable, if your tasks can be 2x parallelized, have working sets around the size of the cache, and work on the same set for a notable chunk of the time required to store the old set and fetch the next one.
And of course all of this requires the processor and/or compiler to be smart enough to ensure the cache is properly fed new data from main memory. This is frequently the case these days, but not always.
Let's say your workload consists solely in traversing a single linked list. This list fits perfectly in L1.
As an L1 load takes 4 cycles and you can't start the next load untill you completed the previous one, the CPU will stall doing nothing 3/4th of cycles. A 4-way SMT could in principle make use of all the wasted cycles.
Of course no load is even close to purely traversing a linked list, but a lot of non-hpc real world load do spend a lot of time in latency limited sections that can benefit from SMT, so it is not just cache misses.
Agreed 100%. SMT is waaaay more complex than just cache. I was just trying to illustrate in simple scenarios where increasing cache would and would not be beneficial to SMT.
Let’s imagine we have 8 cores with SMT, and we’re running a task that (in theory) scales roughly linearly up to 16 threads. If each thread’s working memory is around half as much as there is cache available to each thread, but each working set is only used briefly, then SMT is going to be hugely beneficial: while one hyperthread is committing and fetching memory, the other one’s cache is already full with a new working set and can begin computing. Increasing cache will increase the allowable working set size without causing cache contention between hyperthreads.
Alternatively, if the working set is sufficiently large per thread (probably >2/3 the amount of cache available), SMT becomes substantially less useful. When the first hyperthread finishes its work, the second hyperthread has to still wait for some (or all) of its working set to be fetched from main memory (or higher cache levels if lucky). This may take just as long as simply keeping hyperthread #1 fed with new working sets. Increasing cache in this scenario will increase SMT performance almost linearly, until each hyperthread’s working set can be prefetched into the lowest cache levels while the other hyperthread is busy working.
Also consider the situation where the working set is much, much smaller than the available cache, but lots of computing must be done to it. In this case, a single hyperthread can continually be fed with new data, since the old set can be purged to main memory and the next set can be loaded into cache long before the current set is processed. SMT provides no benefit here no matter how large you grow the cache (unless the tasks use wildly different components of the core and they can be run at instruction-level parallelism - but that’s tricky to get right and you may run into thermal or power throttling before you can actually get enough performance to make it worthwhile).
Of course the real world is way more complicated than that. Many tasks do not scale linearly with more threads. Sometimes running on 6 “real” cores vs 12 SMT threads can result in no performance gain, but running on 8 “real” cores is 1/3 faster. And sometimes SMT will give you a non-linear speedup but a few more (non-SMT) cores will give you a better (but still non-linear) speedup. So short answer: yes, sometimes more cache makes SMT more viable, if your tasks can be 2x parallelized, have working sets around the size of the cache, and work on the same set for a notable chunk of the time required to store the old set and fetch the next one.
And of course all of this requires the processor and/or compiler to be smart enough to ensure the cache is properly fed new data from main memory. This is frequently the case these days, but not always.