More

jared_hulbert · 2025-09-06T00:39:11 1757119151

Just ran a version with 6 prefetching threads. I get 5.81GB/s. Same as the io_uring with 2 drives, but still a lot slower than the in memory solution.

jared_hulbert · 2025-09-05T21:03:52 1757106232

Someone else suggested this, results are even worse by 2.5s.

jared_hulbert · 2025-09-05T18:33:03 1757097183

The io_uring solution avoids this whole effort of mapping. It doesn't have to map the already-in-RAM pages at all. It reuses a small set of buffers. So there is a lot of random cache-miss prone work that mmap() has to do that the io_uring solution avoids. If mmap() does this in the background it would cache up with io_uring. I'd then have to get a couple more drives to get io_uring to catch up. With enough drives I'd bet they'd be closer than you think. I still think I could get the io_uring to be faster than the mmap() even if the count never faulted, mostly because the io_uring has a smaller TLB footprint and can fit in L3 cache. But it'd be tough.

kragen · 2025-09-05T20:33:34 1757104414

I agree that io_uring is a fundamentally more efficient approach, but I think the performance limits you're currently measuring with mmap() aren't the fundamental ones imposed by the mmap() API, and I think that's what you're saying too?

jared_hulbert · 2025-09-05T03:18:35 1757042315

Not by a ton but if you add up the DDR5 channel bandwidth and the PCIe lanes most systems the PCIe bandwidth is higher. Yes. HBM and L3 cache will be higher than the PCIe.

jared_hulbert · 2025-09-05T01:46:28 1757036788

YES! gcc and clang don't like to optimize this. But they do if you hardcode the size_bytes to an aligned value. It kind of makes sense, what if a user passes size_bytes as 3? With enough effort the compilers could handle this, but it's a lot to ask.

I just ran MAP_POPULATE the results are interesting.

It speeds up the counting loop. Same speed or higher as the my read() to a malloced buffer tests.

HOWEVER... It takes a longer time overall to do the population of the buffer. The end result is it's 2.5 seconds slower to run the full test when compared to the original. I did not guess that one correctly.

time ./count_10_unrolled ./mnt/datafile.bin 53687091200 unrolled loop found 167802249 10s processed at 5.39 GB/s ./count_10_unrolled ./mnt/datafile.bin 53687091200 5.58s user 6.39s system 99% cpu 11.972 total time ./count_10_populate ./mnt/datafile.bin 53687091200 unrolled loop found 167802249 10s processed at 8.99 GB/s ./count_10_populate ./mnt/datafile.bin 53687091200 5.56s user 8.99s system 99% cpu 14.551 total

titanomachy · 2025-09-05T05:18:37 1757049517

Hmm, I expected some slowdown from POPULATE, but I thought it would still come out ahead. Interesting!

mischief6 · 2025-09-05T02:28:51 1757039331

it could be interesting to see what ispc does with similar code.

jared_hulbert · 2025-09-05T01:14:34 1757034874

I worked on SSDs for years. Too many people are suffering from insufficiently solid values of "disk" IMHO.

jared_hulbert · 2025-09-05T01:10:57 1757034657

int fd = open(filename, O_RDONLY); void* buffer = mmap(NULL, size_bytes, PROT_READ, (MAP_HUGETLB | MAP_HUGE_1GB), fd, 0);

This doesn't work with a file on my ext4 volume. What am I missing?

inetknght · 2025-09-05T17:05:05 1757091905

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

inetknght · 2025-09-05T02:25:08 1757039108

What issue are you having? Are you receiving an error? This is the kind of question that StackOverflow or perhaps an LLM might be able to help you with. I highly suggest reading the documentation for mmap to understand what issues could happen and/or what a given specific error code might indicate; see the NOTES section:

> Huge page (Huge TLB) mappings

> For mappings that employ huge pages, the requirements for the arguments of mmap() and munmap() differ somewhat from the requirements for mappings that use the native system page size.

> For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underlying huge page size.

Ensure that the file is at least the page size, and preferably sized to align with a page boundary. Then also ensure that the length parameter (size_bytes in your example) is also aligned to a boundary.

There are also other important things to understand for these flags, which are described in the documentation, such as information available from /sys/kernel/mm/hugepages

https://www.man7.org/linux/man-pages/man2/mmap.2.html

vlovich123 · 2025-09-05T14:37:35 1757083055

The most common problem is the system not having any support for hugetable allocated. Don’t have the specific things that need to be configured handy.

jared_hulbert · 2025-09-05T01:06:31 1757034391

The original blog post title is intentionally clickbaity. You know, to bait people into clicking. Also I do want to challenge people to really think here.

Seeing if the cached file data can be accessed quickly is the point of the experiment. I can't get mmap() to open a file with huge pages.

void* buffer = mmap(NULL, size_bytes, PROT_READ, (MAP_HUGETLB | MAP_HUGE_1GB), fd, 0); doesn't work.

You can can see my code here https://github.com/bitflux-ai/blog_notes. Any ideas?

mastax · 2025-09-05T01:48:46 1757036926

MAP_HUGETLB can't be used for mmaping files on disk, it can only be used with MAP_ANONYMOUS, with a memfd, or with a file on a hugetlbfs pseudo-filesystem (which is also in memory).

mananaysiempre · 2025-09-05T06:38:52 1757054332

It looks like there is in theory support for that[1]? But the patches for ext4[2] did not go through.

[1] https://lwn.net/Articles/686690/

[2] https://lwn.net/Articles/718102/

inetknght · 2025-09-05T02:28:18 1757039298

> MAP_HUGETLB can't be used for mmaping files on disk

False. I've successfully used it to memory-map networked files.

squirrellous · 2025-09-05T06:38:14 1757054294

This is quite interesting since I, too, was under the impression that mmap cannot be used on disk-backed files with huge pages. I tried and failed to find any official kernel documentation around this, but I clearly remember trying to do this at work (on a regular ECS machine with Ubuntu) and getting errors.

Based on this SO discussion [1], it is possibly a limitation with popular filesystems like ext4?

If anyone knows more about this, I'd love to know what exactly are the requirements for using hugepages this way.

[1] https://stackoverflow.com/questions/44060678/huge-pages-for-...

inetknght · 2025-09-05T17:05:36 1757091936

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

squirrellous · 2025-09-06T07:58:23 1757145503

Cool! Thanks for the example. The aforementioned work thing requires MAP_SHARED as well which IIRC is the reason it would fail when used together with files and huge pages, but private mappings work as you show.

bawolff · 2025-09-05T10:06:56 1757066816

Trying to google this i found https://lwn.net/Articles/718102/ which suggests that there was discussion about it back in 2017. But i can't find anything else about it except a patchset that i guess wasnt merged (?). So maybe it was just a proposal that never made it in.

Honestly i never knew any of this i thought huge pages just worked for all of mmap.

minitech · 2025-09-05T03:07:44 1757041664

That doesn’t sound like the intended meaning of “on disk”.

inetknght · 2025-09-05T03:11:01 1757041861

Kernel doesn't really care about "on disk", it cares about "on filesystem".

The "on disk" distinction is a simplification.

pclmulqdq · 2025-09-05T03:54:16 1757044456

The kernel absolutely does care about the "on disk" distinction because it determines what driver to use.

ddtaylor · 2025-09-05T05:23:24 1757049804

The interface is handled by the kernel.

inetknght · 2025-09-05T17:05:29 1757091929

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

jared_hulbert · 2025-09-05T18:05:22 1757095522

Adding MAP_HUGE_1GB and not MAP_HUGETLB does compile and run for me. Not convinced that its' actually doing anything. Performance is the same.

inetknght · 2025-09-05T18:18:43 1757096323

Well now that it works, feel free to start poking around at it for a follow-up blog post :)

bawolff · 2025-09-05T20:01:26 1757102486

The mmap man page kind of implies that would be a no-op, but i haven't tested myself.

loloquwowndueo · 2025-09-05T02:59:43 1757041183

Share your code?

inetknght · 2025-09-05T17:06:29 1757091989

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

inetknght · 2025-09-05T03:07:10 1757041630

I don't work there any more (it was a decade ago) and I'm pretty busy right now with a new job coming up (offered today).

Do you have kernel documentation that says that hugetlb doesn't work for files? I don't see that stated anywhere.

Sesse__ · 2025-09-05T09:12:43 1757063563

It's filesystem-dependent. In particular, tmpfs will work. To the best of my knowledge, no “normal” filesystems (e.g., ext4, xfs) will.

inetknght · 2025-09-06T01:57:54 1757123874

It works fine on my ext4 fs...

jandrewrogers · 2025-09-05T02:23:02 1757038982

Read the man pages, there are restrictions on using the huge page option with mmap() that mean it won’t do what you might intuit it will in many cases. Getting reliable huge page mappings is a bit fussy on Linux. It is easier to control in a direct I/O context.

jared_hulbert · 2025-09-05T00:00:07 1757030407

Lol. Thanks.

MaxikCZ · 2025-09-05T05:48:33 1757051313

Its not even about clickbait for me, but I really dont want to go parse an article to figure out what is meant by "Memory is slow, Disk is fast". You want "clickbait" to make people click and think, we want descriptive tittles to know what the article is about before we read it. That used to be original purpose of tittles, we like it that way.

Its like as if youd label your food product "you wont believe this", and forced customers to figure what it is from ingredients list.

robertlagrant · 2025-09-05T08:26:47 1757060807

> Its like as if youd label your food product "you wont believe this", and forced customers to figure what it is from ingredients list.

Indeed[0].

[0] https://en.wikipedia.org/wiki/I_Can't_Believe_It's_Not_Butte...!

MaxikCZ · 2025-09-05T21:06:06 1757106366

Oh god..

jared_hulbert · 2025-09-05T18:14:32 1757096072

I get that. But I do actually show a scenario where accessing data from memory using a very standard mechanism IS slower than a newer but equally standard way of accessing data from an NVMe drive.

"Accessing memory is slower in some circumstances than direct disk access"

jared_hulbert · 2025-09-04T23:18:30 1757027910

Cool. Original author here. AMA.

whizzter · 2025-09-05T05:47:26 1757051246

Like people mention, hugetlb,etc could be an improvement, but the core issue holding it it down probably has to do with mmap, 4k pages and paging behaviours, mmap will cause faults for each "small" 4k page not in memory, causing a kernel jump and then whatever machinery to fill in the page-cache (and bring up data from disk with the associated latency).

This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.

The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.

jared_hulbert · 2025-09-05T18:45:27 1757097927

When I run the 50GB in-mem setup I still have 40GB+ of free memory, I drop the page cache before I run "sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'" there wouldn't really be anything to evict from page cache and swap isn't changing.

I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.

pianom4n · 2025-09-05T17:38:43 1757093923

The in-memory solution creates a 2nd copy of the data so 50GB doesn't fit in memory anymore. The kernel is forced to drop and then reload part of the cached file.

nchmy · 2025-09-04T23:56:46 1757030206

I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...

And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?

jared_hulbert · 2025-09-05T00:21:24 1757031684

I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.

When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.

Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.

nchmy · 2025-09-05T03:24:01 1757042641

The hockey stick growth is the entire problem - it's an optical illusion resulting from the fact that going from 100 to 200 is the same rate as 200 to 400. And 800, 1600. You understand exponents.

Log axis solves this, and turns meaningless hockey sticks into generally a straightish line that you can actually parse. If it still deviates from straight, then you really know there's true changes in the trendline.

Lines on same chart can all be divided by their initial value, anchoring them all at 1. Sometimes they're still a mess, but it's always worth a try.

You're enormously knowledgeable and the posts were fascinating. But this is stats 101. Not doing this sort of thing, especially explicitly in favour of showing a hockey stick, undermines the fantastic analysis.

john-h-k · 2025-09-05T00:10:08 1757031008

You mention modern server CPUs have capability to “read direct to L3, skipping memory”. Can you elaborate on this?

jared_hulbert · 2025-09-05T00:27:10 1757032030

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

AMD has something similar.

The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.

In theory... the specifics of what is supported exactly? I can't vouch for that.

josephg · 2025-09-05T01:58:04 1757037484

I’d be fascinated to see a comparison with SPDK. That bypasses the kernel’s NVMe / SSD driver and controls the whole device from user space - which is supposed to avoid a lot of copies and overhead.

You might be able to set up SPDK to send data directly into the cpu cache? It’s one of those things I’ve wanted to play with for years but honestly I don’t know enough about it.

https://spdk.io/

jared_hulbert · 2025-09-05T02:39:27 1757039967

spdk and I go way back. I'm confident it'd be about the same, possibly ~200-300MB/s more, I was pretty close to the rated throughput of the drives. Io_uring has really closed the gap that used to exist between the in kernel and userspace solutions.

With the Intel connection they might have explicit support for DDIO. Good idea.

benlwalker · 2025-09-05T14:45:59 1757083559

SPDK will be able to fully saturate the PCIe bandwidth from a single CPU core here (no secret 6 threads inside the kernel). The drives are your bottleneck so it won't go faster, but it can use a lot less CPU.

But with SPDK you'll be talking to the disk, not to files. If you changed io_uring to read from the disk directly with O_DIRECT, you wouldn't have those extra 6 threads either. SPDK would still be considerably more CPU efficient but not 6x.

DDIO is a pure hardware feature. Software doesn't need to do anything to support it.

Source: SPDK co-creator

Jap2-0 · 2025-09-04T23:33:52 1757028832

Would huge pages help with the mmap case?

jared_hulbert · 2025-09-04T23:49:18 1757029758

Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.

But the arm64 systems with 16K or 64K native pages would have fewer faults.

inetknght · 2025-09-04T23:54:13 1757030053

> I'd have look into that. Off the top of my head I don't know how you'd make that happen.

Pass these flags to your mmap call: (MAP_HUGETLB | MAP_HUGE_1GB)

jared_hulbert · 2025-09-05T00:02:30 1757030550

Would this actually create huge page page cache entries?

inetknght · 2025-09-05T00:11:23 1757031083

It's right in the documentation for mmap() [0]! And, from my experience, using it with an 800GB file provided a significant speed-up, so I do believe the documentation is correct ;)

And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

inetknght · 2025-09-04T23:53:37 1757030017

> Would huge pages help with the mmap case?

Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.

comradesmith · 2025-09-05T00:26:47 1757032007

Thanks for the article. What about using file reads from a mounted ramdisk?

jared_hulbert · 2025-09-05T03:13:16 1757041996

Hmm. tmpfs was slower. hugetlbfs wasn't working for me.