Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for the "hwloc" tip. I hadn't thought about that.

I was thinking of doing something like that. Weirdly I got sustained throughput differences when I killed & restarted fio. So, if I got 11M IOPS, it stayed at that level until I killed fio & restarted. If I got 10.8M next, it stayed like it until I killed & restarted it.

This makes me think that I'm hitting some PCIe/memory bottleneck, dependent on process placement (which process happens to need to move data across infinity fabric due to accessing data through a "remote" PCIe root complex or something like that). But then I realized that Zen 2 has a central IO hub again, so there shouldn't be a "far edge of I/O" like on current gen Intel CPUs (?)

But there's definitely some workload placement and I/O-memory-interrupt affinity that I've wanted to look into. I could even enable the NUMA-like-mode from BIOS, but again with Zen 2, the memory access goes through the central infinity-fabric chip too, I understand, so not sure if there's any value in trying to achieve memory locality for individual chiplets on this platform (?)




So there are 2 parts to cpu affinity. a) cpu assigned to ssd for handling interrupts and b) cpu assigned to fio. numactl is your friend for experimenting with with changing fio affinity.

https://access.redhat.com/documentation/en-us/red_hat_enterp... tells you how to tweak irq handlers.

You usually want to change both. pinning each fio process + each interrupt handler to specific cpus will reach highest perf.

You can even use isolcpus param to linux kernel to reduce jitter from things you don't care about to minimize latency.(wont do much for bandwidth)


The PCIe is all on a single IO die, but internally it is organized into quadrants that can produce some NUMA effects. So it is probably worth trying out the motherboard firmware settings to expose your CPU as multiple NUMA nodes, and using the FIO options to allocate memory only on the local node, and restricting execution to the right cores.


Yep, I enabled the "numa-like-awareness" in BIOS and ran a few quick tests to see whether the NUMA-aware scheduler/NUMA balancing would do the right thing and migrate processes closer to their memory over time, but didn't notice any benefit. But yep I haven't manually locked down the execution and memory placement yet. This placement may well explain why I saw some ~5% throughput fluctuations only if killing & restarting fio and not while the same test was running.


I have done some tests on AMD servers and I the Linux scheduler does a pretty good job. I do however get noticeable (a couple percent) better performance by forcing the process to run on the correct numa node.

Make sure you get as many numa domains as possible in your BIOS settings.

I recommend using numactl with the cpu-exclusive and mem-exclusive flags. I have noticed a slight perfomance drop when RAM cache fills beyond the sticks local to the cpus doing work.

One last comment is that you mentioned interrupts being "stiped" among CPUs. I would recommend pinning the interrupts from one disk to one numa-local CPU and using numactl to run fio for that disk on the same CPU. An additional experiment is to, if you have enough cores, pin interrupts to CPUs local to disk, but use other cores on the same numa node for fio. That has been my most successful setup so far.


I have the same box, but with the 32 core CPU and fewer NVMe drives. I've not poked at all the PCIe slots yet, but all that I've looked at are in NUMA node 1. This includes the on board M.2 slots. It is in NPS=4 mode.


Mine goes only up to 2 NUMA nodes (as shown in numactl --hardware), despite setting NPS4 in BIOS. I guess it's because I have only 2 x 8-core chiplets enabled (?)


Yes, that is what I would expect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: