Yep, I enabled the "numa-like-awareness" in BIOS and ran a few quick tests to see whether the NUMA-aware scheduler/NUMA balancing would do the right thing and migrate processes closer to their memory over time, but didn't notice any benefit. But yep I haven't manually locked down the execution and memory placement yet. This placement may well explain why I saw some ~5% throughput fluctuations only if killing & restarting fio and not while the same test was running.
I have done some tests on AMD servers and I the Linux scheduler does a pretty good job.
I do however get noticeable (a couple percent) better performance by forcing the process to run on the correct numa node.
Make sure you get as many numa domains as possible in your BIOS settings.
I recommend using numactl with the cpu-exclusive and mem-exclusive flags. I have noticed a slight perfomance drop when RAM cache fills beyond the sticks local to the cpus doing work.
One last comment is that you mentioned interrupts being "stiped" among CPUs. I would recommend pinning the interrupts from one disk to one numa-local CPU and using numactl to run fio for that disk on the same CPU.
An additional experiment is to, if you have enough cores, pin interrupts to CPUs local to disk, but use other cores on the same numa node for fio. That has been my most successful setup so far.