Yeah, that’s what I wondered - I’m ok with using multiple cores, would I get even more IOPS when doing smaller I/Os. Is the benchmark suite you used part of the SPDK toolkit (and easy enough to run?)
Whether you get more IOPs with smaller I/Os depends on a number of things. Most drives these days are natively 4KiB blocks and are emulating 512B sectors for backward compatibility. This emulation means that 512B writes are often quite slow - probably slower than writing 4KiB (with 4KiB alignment). But 512B reads are typically very fast. On Optane drives this may not be true because the media works entirely differently - those may be able to do native 512B writes. Talk to the device vendor to get the real answer.
For at least reads, if you don't hit a CPU limit you'll get 8x more IOPS with 512B than you will with 4KiB with SPDK. It's more or less perfect scaling. There's some additional hardware overheads in the MMU and PCIe subsystems with 512B because you're sending more messages for the same bandwidth, but my experience has been that it is mostly negligible.
The benchmark builds to build/examples/perf and you can just run it with -h to get the help output. Random 4KiB reads at 32 QD to all available NVMe devices (all devices unbound from the kernel and rebound to vfio-pci) for 60 seconds would be something like:
perf -q 32 -o 4096 -w randread -t 60
You can specify only test specific devices with the -r parameter (by BUS:DEVICE:FUNCTION essentially). The tool can also benchmark kernel devices. Using -R will turn on io_uring (otherwise it uses libaio), and you simply list the block devices on the command line after the base options like this: