Hacker News new | past | comments | ask | show | jobs | submit login

This kind of [stuff] drives me nuts every time I’m trying to read the performance telemetry tea leaves at work.

Do we have requests that take longer, or did Linux do something dumb with thread affinity?




Pro tip: if you care about performance, and especially if you care about reliable performance measurements, you should pin and accelerate your threads. Using isol CPUs is the next step.

In other words: if your application is at the mercy of Linux making bad decisions about what threads run where, that is a performance bug in your app.


How would you do this when running in kubernetes? Afaik it just guarantees that you get scheduled on some cpu for the “right” amount of time. Not on which cpu that is.

I only know that there is one scheduler that gives you dedicated cores if you set the request and limit both to equal multiples of 1000.


On post-docker pre-kubernetes time I used `--cpuset-cpus` on docker args to dedicate specific cpus to redis instances, using CoreOS and fleet for cluster orchestration.


> Afaik it just guarantees that you get scheduled on some cpu for the “right” amount of time. Not on which cpu that is.

If you set cpu request == cpu limits, then container will be pinned to CPU cores using cpuset command [0].

There is also a way to influence NUMA node allocation using Memory Manager [1]

[0] https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

[1] https://kubernetes.io/docs/tasks/administer-cluster/memory-m...


Another reason to put off moving to kubernetes.


We use Kubernetes to host apps running Node.js, and K8s is great in keeping the apps running, scaling the pods and rolling out upgrades. Use the right tool for the job. If you are trying to saturate 24-core server with CPU-bound computation, don't think Kubernetes.


> Do we have requests that take longer, or did Linux do something dumb with thread affinity?

Yes.

Cross-socket communications do take longer, but a properly configured NUMA-aware OS should probably have segregated threads of the same process to the same socket, so the performance should have increased linearly from 1 to 12 threads, then fallen off a cliff as the cross-socket effect started blowing up performances.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: