Oops, O_DIRECT does not actually make that big of a difference. I had updated my ad-hoc test to use O_DIRECT, but didn't check that write() now returned errors because of wrong alignment ;-)
As mentioned in the sibling comment, syncs are still slow. My initial 1-2ms number came from a desktop I bought in 2018, to which I added an NVME drive connected to an M.1 slot in 2022. On my current test system I'm seeing avg latencies of around 250us, sometimes a lot more (there a fluctuations).
# put the following in a file "fio.job" and run "fio fio.job"
# enable either direct=1 (O_DIRECT) or fsync=1 (fsync() after each write())
[Job1]
#direct=1
fsync=1
readwrite=randwrite
bs=64k # size of each write()
size=256m # total size written
Add sync=1 to your fio O_DIRECT write tests (not fsync, but sync=1) and you’ll see a big difference on consumer SSDs without power loss protection for their controller buffers. It adds the FUA flag (force unit access) to the write requests to ensure persistence of your writes, O_DIRECT alone won’t do that