Let's say your workload consists solely in traversing a single linked list. This list fits perfectly in L1.
As an L1 load takes 4 cycles and you can't start the next load untill you completed the previous one, the CPU will stall doing nothing 3/4th of cycles. A 4-way SMT could in principle make use of all the wasted cycles.
Of course no load is even close to purely traversing a linked list, but a lot of non-hpc real world load do spend a lot of time in latency limited sections that can benefit from SMT, so it is not just cache misses.
Agreed 100%. SMT is waaaay more complex than just cache. I was just trying to illustrate in simple scenarios where increasing cache would and would not be beneficial to SMT.
As an L1 load takes 4 cycles and you can't start the next load untill you completed the previous one, the CPU will stall doing nothing 3/4th of cycles. A 4-way SMT could in principle make use of all the wasted cycles.
Of course no load is even close to purely traversing a linked list, but a lot of non-hpc real world load do spend a lot of time in latency limited sections that can benefit from SMT, so it is not just cache misses.