I recently learned this is no longer true! AMD EPYC processors (and maybe other recent AMDs?) divide cores into groups called "core complexes" (CCX), each of which has a separate L3 cache. My particular processor is 32-core where each set of 4 cores is a CCX. I discovered this when trying to figure out why a benchmark was performing wildly differently from one run to the next with a bimodal distribution -- it turned out to depend on whether Linux has scheduled the two processes in my benchmark to run on the same CCX vs. different CCXs.
> I recently learned this is no longer true! AMD EPYC processors (and maybe other recent AMDs?) divide cores into groups called "core complexes" (CCX), each of which has a separate L3 cache.
It's still kinda true, just that "a CPU" in that context is a CCX. Cross CCX communications has shown up a fair bit in reviews and benches, and really in all chips at that scale (e.g. Ampere's Altras and Intel's Xeons): https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 and one of the "improvements" in Zen 3 is the CCX are much larger (there's one CCX per CCD, rather than 2) so there's less crosstalk.
And it could already be untrue previously e.g. the Pentium D were rushed out by sticking two P4 dies on the same PCB, I think they even had to go through the northbridge to communicate, so they were dual socket in all but physical conformation (hence being absolute turds). I don't think they had L3 at all though, so that wasn't really a factor, but still...
Yup, and why overclocking on AMD tests each of your ccx/cores, marks gold cores that can reach 5ghz+, and slower cores that dont overclock. Also why AMD tests each ccx, and moves worse ones to lower end products, and higher ones to the 5950x.
Then you can OC tweak each core and try to milk out performance, PBO tends to be pretty good on detecting which cores with slight performance boost. With a few bios options, all that adds up without effort.
And then finally get a better (advanced) scheduler in windows/linux that can move workloads around per core depending on workload. Windows released multiple fixes for the scheduler on AMD starting with win10 1903.
I find scheduler mods and modest bios tweaks to increase performance without much effort, very detectable performance.
Linux I use xanmod (and pf-kernel before that)
Windows I use project lasso and AMD PBO.
Slow/fast cores and scheduler tweaks is how win11/arm mac/android, makes things appear faster too.
Amusing, scheduler/core tweaks been around for linux for a decade+, making the desktop super smooth, but its now mainstream in win11/arm osx.
There are core complexes in Apple’s M1 variations as well.
And it’s not just L3 that can shared by various chips.
Each complex also has its own memory bandwidth, so running on two core in two complexes will get around twice the memory bandwidth of two cores in the same complex.
I have that Pentium D -3.4GHz CPU somewhere, It thought me important lessons like CPU clocks mean nothing if the architecture is botched up, to be weary about purchasing first generation compute hardware and more importantly not to trust big-name brands blindly.
Not only did I put a hole in my father's monthly budget by paying a premium for this but continued to do so for years with the power bills for this inefficient crap.
I remember reading that some politics between intel India-Israel teams led to rushed up design blunder, Couldn't find that article now.
> And it could already be untrue previously e.g. the Pentium D were rushed out by sticking two P4 dies on the same PCB, I think they even had to go through the northbridge to communicate
Part of me hopes you're wrong here. That is absolutely absurd.
But Intel got caught completely unaware by the switch to multi-core, just as it had by the 64b switch.
The eventual Core 2 was not ready yet (Intel even had to bridge to it with the intermediate Core, which really had more to do with the Pentium M than with the Core 2 though it did feature a proper dual-core design... so much so that the Core solo was actually a binned Duo with one core disabled).
So anyway Intel was caught with its pants around its ankle for the second time and they couldn't let that happen. And they actually beat AMD to market, having turned out a working dual-core design in the time between AMD's announcement of the dual-core opteron (and strong hints of x2) and the actual release, about 8 months.
To manage that Intel could not rearchitecture their chip (and probably didn't want to as it'd become clear Netburst was a dead-end), so they stapled two Prescott cores together, FSB included, and connected both to the northbridge.
It probably took more time to validate that solution for the server market, which is why where AMD released the dual core Opterons in April and Athlons in May, it took until October for the first dual-core Xeon to be available.
"The Pentium D brand refers to two series of desktop dual-core 64-bit x86-64 microprocessors with the NetBurst microarchitecture, which is the dual-core variant of Pentium 4 "Prescott" manufactured by Intel. Each CPU comprised two dies, each containing a single core, residing next to each other on a multi-chip module package."
They didn't have to go through the northbridge itself, but they had to go over the frontside bus that connects the CPU to the northbridge (and would normally be point to point).
Just as information for some later readers, but this is applicable not just to Epyc but also the consumer version Threadripper. My understanding is that this is an example of a Non-Uniform Memory Access (NUMA) which was also used to link multiple cpus in different sockets together for a long time now, but now they've been integrated into a cpu that fits in one socket.
This actually has an importance if you are running a VM on such a system since you'll run into things like the actual RAM (not l3 cache) is often directly linked to a particular NUMA node. For example accessing memory in the first ram stick vs the second will give different latencies as it goes from ccx1 => ccx2 => stick2 versus ccx1 => stick1. This is applicable to I think 2XXX versions and earlier for threadripper. My understanding is that they solved this in later versions using the infinity fabric (IO die) so now all ccx's go through the IO die.
I ran into all of this trying to run an ubuntu machine that ran windows using KVM while passing through my nvidia graphics card.
> Just as information for some later readers, but this is applicable not just to Epyc but also the consumer version Threadripper.
It's applicable to any Zen design with more than one CCX, which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it was 4).
The wiki has the explanation under the "Core config" entry of the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16 cores).
Zen 3 APU are all single-CCX, though there are Zen 2 parts in the 5000 range which are multi-CCX (because why not confuse people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a 2x4.
The rest is either low-core enough to be single-CCX Zen 2 (5300U) or topping out at a single 8-cores CCX (everything else).
Pro tip: if you care about performance, and especially if you care about reliable performance measurements, you should pin and accelerate your threads. Using isol CPUs is the next step.
In other words: if your application is at the mercy of Linux making bad decisions about what threads run where, that is a performance bug in your app.
How would you do this when running in kubernetes? Afaik it just guarantees that you get scheduled on some cpu for the “right” amount of time. Not on which cpu that is.
I only know that there is one scheduler that gives you dedicated cores if you set the request and limit both to equal multiples of 1000.
On post-docker pre-kubernetes time I used `--cpuset-cpus` on docker args to dedicate specific cpus to redis instances, using CoreOS and fleet for cluster orchestration.
We use Kubernetes to host apps running Node.js, and K8s is great in keeping the apps running, scaling the pods and rolling out upgrades. Use the right tool for the job. If you are trying to saturate 24-core server with CPU-bound computation, don't think Kubernetes.
> Do we have requests that take longer, or did Linux do something dumb with thread affinity?
Yes.
Cross-socket communications do take longer, but a properly configured NUMA-aware OS should probably have segregated threads of the same process to the same socket, so the performance should have increased linearly from 1 to 12 threads, then fallen off a cliff as the cross-socket effect started blowing up performances.
Out of curiosity — was the “bad” mode a case of two separate workloads competing for limited cache within one CCX, or was it really bad cache behaviour across CCXs causing problems because the two processes shared memory?
The two processes were a client and a server, the client just sends HTTP requests to the server over a unix socket, but otherwise they don't share resources. TBH I'm surprised there was so much impact just from that. There might be something more to the story -- I haven't investigated much yet. But basically when I used `taskset` to limit the benchmark to two cores in different CCX's, it ran about half the speed as when limited to two cores in the same CCX. I guess another possibility is that the kernel is allowing both processes to swap between cores regularly, which would much better explain the disastrous performance penalty. But, in that case I don't understand the bimodal distribution I see when running the benchmark with no core affinity...
I've always wondered how much this affects shared cloud instances (for example, a single c5.small instance). Can your neighbor, who shares your L3 cache, be using memory so much more aggressively than you that it causes you to evict your L2/L1 cache? Or is cache coherency maintained if the core that evicted your L3 cache is known to be living in a different memory space?
Coherency will be maintained (since the protocols support that case). But yes, a separate process would evict that memory. From the processors point of view, they're just addresses tagged with data. Caching behavior doesnt depend on virtual memory addresses because I believe that info is stripped away at that point.
So if someone is thrashing cache on the same core you're on, you will notice it if the processes aren't being shared effectively.
The contents of the cache aren't stored as part of a paused process or context switch. But I'd appreciate a correction here if I'm wrong.
For an example, consider two processes A and B running on a set of cores. If A makes many more memory accesses than B, A can effectively starve B of "cached" memory accesses because A accesses memory more frequently.
If B were run alone, then it's working set would fit in cache. Effectively making the algorithm operate from cache instead of RAM.
BUT. You really have to be hitting the caches hard. Doesn't happen too often in casual applications. I only encountered this on GPUs (where each core has sperate L1 but a shared L2). Even then it's only aa problem if every core is hitting different cache lines.
> Can your neighbor, who shares your L3 cache, be using memory so much more aggressively than you that it causes you to evict your L2/L1 cache?
I'm curious. How could the "neighbor" possibly evict your L1/L2 when it is local to you? Worst it can do is thrash L3 like crazy but if your own data is on L1/L2, how would that get affected?
I believe most caches are designed such that something in a lower level cache obtains information about whether that memory is valid or has been changed from the cache level above it, so typically everything in L1 is also in L2 and L3. If you do not have the data in L3 anymore, this may cause the L1 to reload the memory (into L3 then L2 then L1) to know that the memory is valid.
It's true for Zen 1, Zen+ and Zen 2. But they changed it with Zen 3. Latency went up from 40 cycles to 47 cycles, but accessible for single core cache capacity doubled, and it greatly improved single threaded performance since single-thread now can utilize full CCD l3 cache.
You can enable node-per-CCX mode in UEFI to do that. Otherwise you have to use tools like hwloc to discover the cache topology and pin workloads accordingly.
Press delete or F1 or whatever at boot and look through the menus for the nodes per socket (NPS) setting. This setting may not exist on desktop systems.
Oh, right. I knew that as the BIOS setup, but I guess my terminology is just out of date. I see the "NUMA nodes per socket" option now. Not quite sure yet if that actually allows having two nodes for the two CCXs on my one socket, but I'll play with it...
Reporting back in case anyone looks at this: I think this option does nothing on my system (ASUS TUF Gaming X570-PRO motherboard, latest BIOS version 4021). The "ACPI SRAT L3 Cache as NUMA Domain" also does nothing. Somewhere I read these options are for 2-socket machines only (even though in concept they'd make NUMA meaningful on 1-socket machines). I'd be interested if anyone finds otherwise.
I also tried passing "numa=fake=32G" or "numa=fake=2" to Linux. That syntax seems to match the documentation [1] but produced an error "Malformed early option 'numa'". Haven't dug into why. I'm not sure it'd correctly partition the cores anyway.
> I also tried passing "numa=fake=32G" or "numa=fake=2" to Linux. That syntax seems to match the documentation [1] but produced an error "Malformed early option 'numa'". Haven't dug into why. I'm not sure it'd correctly partition the cores anyway.
Oh, because the stock Ubuntu kernel doesn't enable CONFIG_NUMA_EMU.
> L3 cache is shared by all cores of a CPU.
I recently learned this is no longer true! AMD EPYC processors (and maybe other recent AMDs?) divide cores into groups called "core complexes" (CCX), each of which has a separate L3 cache. My particular processor is 32-core where each set of 4 cores is a CCX. I discovered this when trying to figure out why a benchmark was performing wildly differently from one run to the next with a bimodal distribution -- it turned out to depend on whether Linux has scheduled the two processes in my benchmark to run on the same CCX vs. different CCXs.
https://en.wikipedia.org/wiki/Epyc shows the "core config" of each model, which is (number of CCX) x (cores per CCX).