1G huge pages had (have?) performance benefits on managed runtimes for certain scenarios (Both the JIT code cache and the GC space saw uplift on the SpecJ benchmarks if I recall correctly)
If using relatively large quantities of memory 2M should enable much higher TLB hit rates assuming the CPU doesn't do something silly like only having 4 slots for pages larger than 4k ¬.¬
Because of virtual address translation [1] speed up. When a memory access is made by a program, the CPU must first translate the virtual address to a physical address, by walking a hierarchical data structure called a page table [2]. Walking the page tables is slow, thus CPUs implement a small on-CPU cache of virtual-to-physical translations called a TLB [1]. The TLB has a limited number of entries for each page size. With 4 KiB pages, the contention on this cache is very high, especially if the workload has a very large workingset size, therefore causing frequent cache evictions and slow walk of the page tables. With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB. For example, a TLB with 1024 entries can cover a maximum of 4 MiB of workingset memory. With 2 MiB pages, it can cover up to 2 GiB of workingset memory. Often, the CPU has different number of entries for each page size.
However, it is known that larger page sizes have higher internal fragmentation and thus lead to memory wastage. It's a trade off. But generally speaking, for modern systems, the overhead of managing memory in 4 KiB is very high and we are at a point where switching to 16/64 KiB is almost always a win. 2 MiB is still a bit of a stretch, though, but transparent 2 MiB pages for heap memory is enabled by default on most major Linux distributions, aka THP [2]
Source: my PhD is on memory management and address translation on large memory systems, having worked both on hardware architecture of address translation and TLBs as well as the Linux kernel. I'm happy to talk about this all day!
I'd like to ask how applications should change their memory allocation or usage patterns to maximise the benefit of THP. Do memory allocators (glibc mainly) need config tweaking to coalesce tiny mallocs into 2MB+ mmaps, will they just always do that automatically, do you need to use a custom pool allocator so you're doing large allocations, or are you never going to get the full benefit of huge tables without madvise/libhugetlbfs? And does this apply to Mac/Windows/*BSD at all?
[Edit: ouch, I see /sys/kernel/mm/transparent_hugepage/enabled is default set to 'madvise' on my system (Slackware) and as a result doing nearly nothing. But I saw it enabled in the past. Well that answers a lot of my questions: got to use madvise/libhugetlbfs.]
I read you also need to ensure ELF segments are properly aligned to get transparent huge pages for code/data.
Another question. From your link [2]:
> An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good.
Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?
In current Linux systems, there are two main ways to benefit from huge pages. 1) There is the explicit, user-managed approach via hugetlbfs. That's not very common. 2) Transparently managed by the kernel via THP (userpsace is completely unaware and any application using mmap() and malloc() can benefit from that).
As I mentioned before, most major Linux distributions ship with THP enabled by default. THP automatically allocates huge pages for mmap memory whenever possible (that is when, at least, the region is 2 MiB aligned and is at least 2 MiB in size). There is also a separate kernel thread, khugepaged, that opportunistically tries to coalesce/promote base 4K pages into 2 MiB huge pages, whenever possible.
Library support is not really required for THP, but could be detrimental for its performance and availability on the long run. A library that is not aware of kernel huge pages may employ suboptimal memory management strategies, resulting in inefficient utilization, for example by unintentionally breaking those huge pages (e.g. via unaligned unmapping), or failing to properly release them to the OS as one full unit, undermining their availability on the long run. Afaik, Tcmalloc from Google is the only library with extensive huge page awareness [1].
> Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?
Linux allocates huge pages on first touch. For khugepaged, it only coalesces the pages if all the base pages covering the 2 MiB virtual region exist in some form (not necessary faulted-in. For example, some of those base pages could be in swap space and Linux will first fault them in then migrate them)
Mimalloc has support for huge pages. It also has an option to reserve 1GB pages on program start, and I've had very good performance results using that setting and replacing factorio's allocator. On windows and linux.
> Often, the CPU has different number of entries for each page size.
- Does it mean userspace is free to allocate up to a maximum of 1G? I took pages to have a fixed size.
- Or, you mean CPUs reserve TLB sizes depending on the requested page size?
> With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB
- Would memory allocators / GCs need to be changed to deal with blocks of 1G? Would you say, the current ones found in popular runtimes/implementations are adept at doing so?
- Does it not adversely affect databases accustomed to smaller page sizes now finding themselves paging in 1G at once?
> my PhD is on memory management and address translation on large memory systems
If the dissertation is public, please do link it, if you're comfortable doing so.
> - Does it mean userspace is free to allocate up to a maximum of 1G? I took pages to have a fixed size. > - Or, you mean CPUs reserve TLB sizes depending on the requested page size?
The TLB is a hardware cache with a limited number of entries that cannot dynamically change. Your CPU is shipped with a fixed number of entries dedicated for each page size. Translations of base 4 KiB pages could, for example, have 1024 entries. Translations of 2 MiB pages could have 512 entries and those of 1 GiB usually have a very limited number of only 8 or 16. Nowadays, most CPU vendors increased their 2 MiB TLBs to have the same number of entries dedicated for 4 KiB pages.
If you're wondering why they have to be separate caches, it's because, for any page in memory, you can have both mappings at the same time from different processes or different parts of the same process, with possibly different protections.
> - Would memory allocators / GCs need to be changed to deal with blocks of 1G? Would you say, the current ones found in popular runtimes/implementations are adept at doing so?
> - Does it not adversely affect databases accustomed to smaller page sizes now finding themselves paging in 1G at once?
Runtimes and databases have full control and Linux allows per-process policies via madvise) system call. If a program is not happy with huge pages, it can ask the kernel to be ignored, as it can choose to be cooperative.
> If the dissertation is public, please do link it, if you're comfortable doing so.
I'm still in the PhD process, so no cookies atm :D
I think modern Intel/AMD have same amount of dTLB entries for all page sizes.
For example a modern CPU with 3k TLB entries one can access at max:
- 12MB with 4k page size
- 6GB with 2M page size
- 3TB with 1G page size
If the working set per core is bigger than above numbers you get 10-20% slower memory accesses due to TLB miss penalty.
There is an unfortunate side effect of large pages. Each page (whether huge, large, or small) must be mapped with a single protection that applies to the entire page. This is because hardware memory protection is on a per-page basis. If a large page contains, for example, both read-only code and read/write data, the page must be marked as read/write, meaning that the code will be writable. As a result, device drivers or other kernel-mode code could, either maliciously or due to a bug, modify what is supposed to be read-only operating system or driver code without causing a memory access violation.
There's probably no good reason to put code and data on the same page, it's just one extra TLB entry to use two pages instead so the data page can be marked non-executable.
Search for huge pages in the documentation of a DBMS that implements its own caching in shared memory: Oracle [1], PostgreSQL [2], MySQL [3], etc. When you're caching hundreds of gigabytes, it makes a difference. Here's a benchmark comparing PostgreSQL performance with regular, large, and huge pages [4].
There was a really bad performance regression in Linux a couple of years ago that killed performance with large memory regions like this (can't find a useful link at the moment), and the short-term mitigation was to increase the huge page size from 2MB to 1GB.
It's nice for type 1 hypervisors when carving up memory for guests. When page walks for guest virtual to host physical end up taking sixteen levels, a 1G page short circuits that in half to eight.