> he claimed that because MonetDB focuses on read-only OLAP workloads, then MMAP...

jlokier · on Jan 3, 2023

> The best performance is achieved with the combined method, named 'pread_threadpool'.

I found the same when testing my own database engine. `pread` in a suitably well-designed threadpool outperformend every other option for NVMe random-access 4k reads not in cache.

Variations on how many and which types of locks, work queuing and ordering, and when to start and stop threads and control their number made a considerable difference as well. In certain system configs, `CLONE_IO` made a difference. I use tiny (smaller than 1 page) userspace stacks with `clone`-based threads, and dynamic auto-tuning of the number of blocked and executing threads.

> It uses the `preadv2` syscall to check if the file is in the page cache. More details here: https://github.com/ClickHouse/ClickHouse/pull/26791

That's `preadv2` with the `RWF_NOWAIT` flag. It proved slower when I tried it in my db engine, which was a bit of a surprise. `RWF_NOWAIT` is used to read data synchronously from cache, before passing it to the thread pool to read asynchronously if not in cache.

I expected an average speedup when there are many cache hits, so I was surprised and disappointed to find the `preadv2(..,RWF_NOWAIT)` syscall to be slow enough that it was usually a performance loss overall to use it, at least on the kernel versions and hardware I tested on (a reasonably fast software-RAID NVMe).

A nicer way to look at that is that the auto-tuning thread pool was sufficiently sleek and fast that the asynchronous read was fast enough to make the benefit of a synchronous path too small to be worth it.

One not in your list is mmap_threadpool. I found for many workloads, that was faster than pread_threadpool, and of course it does a better job of sharing memory with the kernel. Unlike synchronous mmap, it is effectively an asynchronous read, where the thread does the page fault instead of syscall, so the main thread is not blocked and device I/O queues are kept full enough.

> And there is a pull request adding io_uring: https://github.com/ClickHouse/ClickHouse/pull/38456 [..] Nevertheless, the advantages of `io_uring` for analytical databases are negligible.

Compared with `pread_threadpool` equivalent (in my db engine), I found `io_uring` was sometimes similar, sometimes slower, never better, so not the preferred default. It makes sense that it could almost reach the devices' I/O capability, though with less control over queue depths than doing it directly in threads.

But I was surprised that the "zero-syscall" queues of `io_uring` didn't provide a noticable improvement over `pread` syscalls, given that I measure a considerable baseline syscall overhead of all syscalls like `pread`, `preadv2` and `futex`, with that overhead having a significant throughput effect in `pread_threadpool` equivalent (because the NVMe devices were fast enough for syscall overhead to affect throughput).

> - Unfortunately, it's [io_uring] unfinished and cannot be merged because the CI has found bugs.

I found what I think is a subtle memory barrier bug in `liburing`. If the ClickHouse implementation is using `liburing` or copying its methods, it's conceivable that may be the cause of the hangs seen in CI. There are also kernel versions where `io_uring` was buggy, evidenced by changes in later kernels to fix bugs.