The system that was tested there was PCIe bandwidth constrained because this was...

tanelpoder · on Jan 29, 2021

Yeah, that’s what I wondered - I’m ok with using multiple cores, would I get even more IOPS when doing smaller I/Os. Is the benchmark suite you used part of the SPDK toolkit (and easy enough to run?)

benlwalker · on Jan 29, 2021

Whether you get more IOPs with smaller I/Os depends on a number of things. Most drives these days are natively 4KiB blocks and are emulating 512B sectors for backward compatibility. This emulation means that 512B writes are often quite slow - probably slower than writing 4KiB (with 4KiB alignment). But 512B reads are typically very fast. On Optane drives this may not be true because the media works entirely differently - those may be able to do native 512B writes. Talk to the device vendor to get the real answer.

For at least reads, if you don't hit a CPU limit you'll get 8x more IOPS with 512B than you will with 4KiB with SPDK. It's more or less perfect scaling. There's some additional hardware overheads in the MMU and PCIe subsystems with 512B because you're sending more messages for the same bandwidth, but my experience has been that it is mostly negligible.

The benchmark builds to build/examples/perf and you can just run it with -h to get the help output. Random 4KiB reads at 32 QD to all available NVMe devices (all devices unbound from the kernel and rebound to vfio-pci) for 60 seconds would be something like:

perf -q 32 -o 4096 -w randread -t 60

You can specify only test specific devices with the -r parameter (by BUS:DEVICE:FUNCTION essentially). The tool can also benchmark kernel devices. Using -R will turn on io_uring (otherwise it uses libaio), and you simply list the block devices on the command line after the base options like this:

perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1

You can get ahold of help from the SPDK community at https://spdk.io/community. There will be lots of people willing to help.

Excellent post by the way. I really enjoyed it.

tanelpoder · on Jan 29, 2021

Thanks! Will add this to TODO list too.

StillBored · on Jan 29, 2021

Yah this has been going on for a while. Before SPDK it was with custom kernel bypasses and fast inifiband/FC arrays. I was involved with a similar project in the early 2000's. Where at the time the bottleneck was the shared xeon bus, and then it moved to the PCIe bus with opterons/nehalem+. In our case we ended up spending a lot of time tuning the application to avoid cross socket communication as well since that could become a big deal (of course after careful card placement).

But SPDK has a problem you don't have with bypasses and uio_ring, in that it needs the IOMMU enabled, and that can itself become a bottleneck. There are also issues for some applications that want to use interrupts rather than poll everything.

Whats really nice about uio_ring is that it sort of standardizes a large part of what people were doing with bypasses.

peluse · on Feb 3, 2021

FYI SPDK doesn't strictly require the IOMMU be enabled. See https://spdk.io/doc/system_configuration.html There's also a new experimental interrupt mode (not for everything) finding some valuable use cases in SPDK, see https://github.com/spdk/spdk/blob/master/CHANGELOG.md and feel free to jump on the SPDK slack channel or email list for more info on either of these https://spdk.io/community/