Thanks for the "hwloc" tip. I hadn't thought about that. I was thinking of doing...

tarasglek · on Jan 30, 2021

So there are 2 parts to cpu affinity. a) cpu assigned to ssd for handling interrupts and b) cpu assigned to fio. numactl is your friend for experimenting with with changing fio affinity.

https://access.redhat.com/documentation/en-us/red_hat_enterp... tells you how to tweak irq handlers.

You usually want to change both. pinning each fio process + each interrupt handler to specific cpus will reach highest perf.

You can even use isolcpus param to linux kernel to reduce jitter from things you don't care about to minimize latency.(wont do much for bandwidth)

wtallis · on Jan 29, 2021

The PCIe is all on a single IO die, but internally it is organized into quadrants that can produce some NUMA effects. So it is probably worth trying out the motherboard firmware settings to expose your CPU as multiple NUMA nodes, and using the FIO options to allocate memory only on the local node, and restricting execution to the right cores.

tanelpoder · on Jan 29, 2021

Yep, I enabled the "numa-like-awareness" in BIOS and ran a few quick tests to see whether the NUMA-aware scheduler/NUMA balancing would do the right thing and migrate processes closer to their memory over time, but didn't notice any benefit. But yep I haven't manually locked down the execution and memory placement yet. This placement may well explain why I saw some ~5% throughput fluctuations only if killing & restarting fio and not while the same test was running.

syoc · on Jan 29, 2021

I have done some tests on AMD servers and I the Linux scheduler does a pretty good job. I do however get noticeable (a couple percent) better performance by forcing the process to run on the correct numa node.

Make sure you get as many numa domains as possible in your BIOS settings.

I recommend using numactl with the cpu-exclusive and mem-exclusive flags. I have noticed a slight perfomance drop when RAM cache fills beyond the sticks local to the cpus doing work.

One last comment is that you mentioned interrupts being "stiped" among CPUs. I would recommend pinning the interrupts from one disk to one numa-local CPU and using numactl to run fio for that disk on the same CPU. An additional experiment is to, if you have enough cores, pin interrupts to CPUs local to disk, but use other cores on the same numa node for fio. That has been my most successful setup so far.

mgerdts · on Jan 30, 2021

I have the same box, but with the 32 core CPU and fewer NVMe drives. I've not poked at all the PCIe slots yet, but all that I've looked at are in NUMA node 1. This includes the on board M.2 slots. It is in NPS=4 mode.

tanelpoder · on Jan 30, 2021

Mine goes only up to 2 NUMA nodes (as shown in numactl --hardware), despite setting NPS4 in BIOS. I guess it's because I have only 2 x 8-core chiplets enabled (?)

mgerdts · on Jan 31, 2021

Yes, that is what I would expect.