This kind of [stuff] drives me nuts every time I’m trying to read the performanc...

dahfizz · on Dec 31, 2021

Pro tip: if you care about performance, and especially if you care about reliable performance measurements, you should pin and accelerate your threads. Using isol CPUs is the next step.

In other words: if your application is at the mercy of Linux making bad decisions about what threads run where, that is a performance bug in your app.

spockz · on Dec 31, 2021

How would you do this when running in kubernetes? Afaik it just guarantees that you get scheduled on some cpu for the “right” amount of time. Not on which cpu that is.

I only know that there is one scheduler that gives you dedicated cores if you set the request and limit both to equal multiples of 1000.

spiddy · on Dec 31, 2021

On post-docker pre-kubernetes time I used `--cpuset-cpus` on docker args to dedicate specific cpus to redis instances, using CoreOS and fleet for cluster orchestration.

nopurpose · on Dec 31, 2021

> Afaik it just guarantees that you get scheduled on some cpu for the “right” amount of time. Not on which cpu that is.

If you set cpu request == cpu limits, then container will be pinned to CPU cores using cpuset command [0].

There is also a way to influence NUMA node allocation using Memory Manager [1]

[0] https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

[1] https://kubernetes.io/docs/tasks/administer-cluster/memory-m...

ncmncm · on Dec 31, 2021

Another reason to put off moving to kubernetes.

watt · on Jan 1, 2022

We use Kubernetes to host apps running Node.js, and K8s is great in keeping the apps running, scaling the pods and rolling out upgrades. Use the right tool for the job. If you are trying to saturate 24-core server with CPU-bound computation, don't think Kubernetes.

masklinn · on Dec 31, 2021

> Do we have requests that take longer, or did Linux do something dumb with thread affinity?

Yes.

Cross-socket communications do take longer, but a properly configured NUMA-aware OS should probably have segregated threads of the same process to the same socket, so the performance should have increased linearly from 1 to 12 threads, then fallen off a cliff as the cross-socket effect started blowing up performances.