The io_uring solution avoids this whole effort of mapping. It doesn't have to map the already-in-RAM pages at all. It reuses a small set of buffers. So there is a lot of random cache-miss prone work that mmap() has to do that the io_uring solution avoids. If mmap() does this in the background it would cache up with io_uring. I'd then have to get a couple more drives to get io_uring to catch up. With enough drives I'd bet they'd be closer than you think. I still think I could get the io_uring to be faster than the mmap() even if the count never faulted, mostly because the io_uring has a smaller TLB footprint and can fit in L3 cache. But it'd be tough.
I agree that io_uring is a fundamentally more efficient approach, but I think the performance limits you're currently measuring with mmap() aren't the fundamental ones imposed by the mmap() API, and I think that's what you're saying too?
Not by a ton but if you add up the DDR5 channel bandwidth and the PCIe lanes most systems the PCIe bandwidth is higher. Yes. HBM and L3 cache will be higher than the PCIe.
YES! gcc and clang don't like to optimize this. But they do if you hardcode the size_bytes to an aligned value. It kind of makes sense, what if a user passes size_bytes as 3? With enough effort the compilers could handle this, but it's a lot to ask.
I just ran MAP_POPULATE the results are interesting.
It speeds up the counting loop. Same speed or higher as the my read() to a malloced buffer tests.
HOWEVER... It takes a longer time overall to do the population of the buffer. The end result is it's 2.5 seconds slower to run the full test when compared to the original. I did not guess that one correctly.
time ./count_10_unrolled ./mnt/datafile.bin 53687091200
unrolled loop found 167802249 10s processed at 5.39 GB/s
./count_10_unrolled ./mnt/datafile.bin 53687091200 5.58s user 6.39s system 99% cpu 11.972 total
time ./count_10_populate ./mnt/datafile.bin 53687091200
unrolled loop found 167802249 10s processed at 8.99 GB/s
./count_10_populate ./mnt/datafile.bin 53687091200 5.56s user 8.99s system 99% cpu 14.551 total
What issue are you having? Are you receiving an error? This is the kind of question that StackOverflow or perhaps an LLM might be able to help you with. I highly suggest reading the documentation for mmap to understand what issues could happen and/or what a given specific error code might indicate; see the NOTES section:
> Huge page (Huge TLB) mappings
> For mappings that employ huge pages, the requirements for the arguments of mmap() and munmap() differ somewhat from the requirements for mappings that use the native system page size.
> For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underlying huge page size.
Ensure that the file is at least the page size, and preferably sized to align with a page boundary. Then also ensure that the length parameter (size_bytes in your example) is also aligned to a boundary.
There are also other important things to understand for these flags, which are described in the documentation, such as information available from /sys/kernel/mm/hugepages
The original blog post title is intentionally clickbaity. You know, to bait people into clicking. Also I do want to challenge people to really think here.
Seeing if the cached file data can be accessed quickly is the point of the experiment. I can't get mmap() to open a file with huge pages.
MAP_HUGETLB can't be used for mmaping files on disk, it can only be used with MAP_ANONYMOUS, with a memfd, or with a file on a hugetlbfs pseudo-filesystem (which is also in memory).
This is quite interesting since I, too, was under the impression that mmap cannot be used on disk-backed files with huge pages. I tried and failed to find any official kernel documentation around this, but I clearly remember trying to do this at work (on a regular ECS machine with Ubuntu) and getting errors.
Based on this SO discussion [1], it is possibly a limitation with popular filesystems like ext4?
If anyone knows more about this, I'd love to know what exactly are the requirements for using hugepages this way.
Cool! Thanks for the example. The aforementioned work thing requires MAP_SHARED as well which IIRC is the reason it would fail when used together with files and huge pages, but private mappings work as you show.
Trying to google this i found https://lwn.net/Articles/718102/ which suggests that there was discussion about it back in 2017. But i can't find anything else about it except a patchset that i guess wasnt merged (?). So maybe it was just a proposal that never made it in.
Honestly i never knew any of this i thought huge pages just worked for all of mmap.
Read the man pages, there are restrictions on using the huge page option with mmap() that mean it won’t do what you might intuit it will in many cases. Getting reliable huge page mappings is a bit fussy on Linux. It is easier to control in a direct I/O context.
Its not even about clickbait for me, but I really dont want to go parse an article to figure out what is meant by "Memory is slow, Disk is fast". You want "clickbait" to make people click and think, we want descriptive tittles to know what the article is about before we read it. That used to be original purpose of tittles, we like it that way.
Its like as if youd label your food product "you wont believe this", and forced customers to figure what it is from ingredients list.
I get that. But I do actually show a scenario where accessing data from memory using a very standard mechanism IS slower than a newer but equally standard way of accessing data from an NVMe drive.
"Accessing memory is slower in some circumstances than direct disk access"
Like people mention, hugetlb,etc could be an improvement, but the core issue holding it it down probably has to do with mmap, 4k pages and paging behaviours, mmap will cause faults for each "small" 4k page not in memory, causing a kernel jump and then whatever machinery to fill in the page-cache (and bring up data from disk with the associated latency).
This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.
The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.
When I run the 50GB in-mem setup I still have 40GB+ of free memory, I drop the page cache before I run "sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'" there wouldn't really be anything to evict from page cache and swap isn't changing.
I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.
The in-memory solution creates a 2nd copy of the data so 50GB doesn't fit in memory anymore. The kernel is forced to drop and then reload part of the cached file.
I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...
And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?
I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.
When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.
Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.
The hockey stick growth is the entire problem - it's an optical illusion resulting from the fact that going from 100 to 200 is the same rate as 200 to 400. And 800, 1600. You understand exponents.
Log axis solves this, and turns meaningless hockey sticks into generally a straightish line that you can actually parse. If it still deviates from straight, then you really know there's true changes in the trendline.
Lines on same chart can all be divided by their initial value, anchoring them all at 1. Sometimes they're still a mess, but it's always worth a try.
You're enormously knowledgeable and the posts were fascinating. But this is stats 101. Not doing this sort of thing, especially explicitly in favour of showing a hockey stick, undermines the fantastic analysis.
The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.
In theory... the specifics of what is supported exactly? I can't vouch for that.
I’d be fascinated to see a comparison with SPDK. That bypasses the kernel’s NVMe / SSD driver and controls the whole device from user space - which is supposed to avoid a lot of copies and overhead.
You might be able to set up SPDK to send data directly into the cpu cache? It’s one of those things I’ve wanted to play with for years but honestly I don’t know enough about it.
spdk and I go way back. I'm confident it'd be about the same, possibly ~200-300MB/s more, I was pretty close to the rated throughput of the drives. Io_uring has really closed the gap that used to exist between the in kernel and userspace solutions.
With the Intel connection they might have explicit support for DDIO. Good idea.
SPDK will be able to fully saturate the PCIe bandwidth from a single CPU core here (no secret 6 threads inside the kernel). The drives are your bottleneck so it won't go faster, but it can use a lot less CPU.
But with SPDK you'll be talking to the disk, not to files. If you changed io_uring to read from the disk directly with O_DIRECT, you wouldn't have those extra 6 threads either. SPDK would still be considerably more CPU efficient but not 6x.
DDIO is a pure hardware feature. Software doesn't need to do anything to support it.
Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.
But the arm64 systems with 16K or 64K native pages would have fewer faults.
It's right in the documentation for mmap() [0]! And, from my experience, using it with an 800GB file provided a significant speed-up, so I do believe the documentation is correct ;)
And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.