Writing a Memory Allocator (2019)

jiggawatts · on Dec 13, 2020

Every time I hear someone talking about writing a memory allocator, it always reminds me of the brilliant CppCon 2015 talk by Andrei Alexandrescu about how he wrote an modular allocator library that can be built up in pieces to make whatever you like.

https://www.youtube.com/watch?v=LIb3L4vKZ7U

OnlyOneCannolo · on Dec 13, 2020

I also really like John Lakos' talks on allocators. He does a great job of systematically decomposing the problem space.

ncmncm · on Dec 13, 2020

The only problem is that he is wrong on almost every detail. He has apparently never heard of mmap, which is awkward because mmap is how memory is obtained, nowadays.

And if you want, you know, performance, it needs to know something about threads, pre-allocating chunks to assign to threads, and batching free ops per thread.

Most of what you can find about writing allocators is written by people who don't actually know how; and actually useful allocators are made by people who don't write tutorials about it.

OnlyOneCannolo · on Dec 13, 2020

What is he wrong about? He's upfront about limiting the scope to single-threaded arena allocators, but still mentions mmap and threads in passing. For talks that are mostly sales pitches, I think it's a pretty informative introduction to how to select and compare allocators.

andi999 · on Dec 13, 2020

Can you explain what you mean by mmap is how memory is obtained nowadays?

yaantc · on Dec 13, 2020

mmap is not only to map a file content into a process address space, it can also be used to request memory from the OS using the MAP_ANONYMOUS flag. Early on a malloc-like allocator would use brk/sbrk to get a chunk of "raw" memory, and manage this for the user providing access using malloc/free. Nowadays the same "raw" memory can be obtained using an anonymous map. A raw pool, when unused, can be returned to the OS by unmapping it.

Using an anonymous map provides address space randomization by default (for the raw pool location), whereas brk/sbrk extends a limit and because of this are predictable.

DmitrySoshnikov · on Dec 13, 2020

sbrk is emulated via mmap today on MacOS and others, this is an abstraction (for bump allocation), not just a function

you may implement custom sbrk via mmap easily

mhh__ · on Dec 13, 2020

sbrk isn't even supported on some platforms anymore, if I'm not mistaken

astrange · on Dec 13, 2020

Thread-based allocators are a bit overrated. In apps it’s quite common for objects to migrate across threads, especially from workers to the main thread.

praptak · on Dec 13, 2020

That's not the point at all. Having a per-thread pool means synchronization-free alloc.

What happens with the object afterwards is not relevant.

fh973 · on Dec 13, 2020

But objects are passed around between threads and another thread might want to free it?

praptak · on Dec 13, 2020

Then it goes to the freeing thread memory pool. Thread-aware mallocs know this and have clever balancing mechanisms for situations like producer-consumer, where one thread allocs a lot and another frees a lot.

Edit: wrong, I re-checked. The solution to this is that marking an object as free is lock-free. The freed objects do not immediately change thread pools.

saagarjha · on Dec 13, 2020

It’s a great talk, but something I would watch after learning how to implement a malloc as it talks about other kinds of allocators and how to make them composable rather than say “how would I encode a freelist”.

mhh__ · on Dec 13, 2020

His implementation of it can be found in the D programming languages standard library.

You can build a jemalloc in about 10 lines roughly.

dapids · on Dec 13, 2020

I've been listening to Dmitry's lessons and lectures lately, I find they are really straightforward and well delivered. You can find some of his other lectures here:

https://m.youtube.com/c/DmitrySoshnikov-education

ajfjrbfbf · on Dec 13, 2020

Implementing a memory allocator seems to require system calls. What happens when there's no OS? For example on an MCU like Arduino, is the memory allocator packaged along with the program data? If so and in the case of an Arduino, is the code for it available anywhere? I would be curious to see how this is implemented on an embedded system.

saagarjha · on Dec 13, 2020

In every allocator there has to be a way to get “the memory” that is available for you to hand out. When a normal OS is present, system calls are used to get this in the form of pages. On embedded environments you don’t need to do this because “the memory” is just something like “addresses from 0x2000000 to 0x8000000” and you don’t have to ask for it, it’s just there for you to use.

mmoskal · on Dec 13, 2020

Here is for example a memory allocator used in micro:bit https://github.com/lancaster-university/codal-core/blob/mast...

It's quite simple and extensively commented.

01100011 · on Dec 13, 2020

You could fairly easily implement a bitmapped memory allocator which hands out chunks of reserved memory. I think you could build a more complex allocator on top of that if needed. For a lot of embedded applications, you really want to avoid dynamic memory allocation though. Use mempools if you need dynamic memory, which will at least limit fragmentation.

milesvp · on Dec 13, 2020

Second this. It’s really easy to fragment memory, and for long running embedded systems it’s not uncommon for free() to be a nop. I had to implement malloc just last week to access psram, and after doing a little research decided all I needed was an offset and some byte alignment. Since I never needed to deallocate the memory, it was very simple code.

theresistor · on Dec 13, 2020

You generally don’t have dynamic memory allocation on such a system. All memory will either be statically allocated globals, or on the stack. Some systems even require stack allocation to be statically determined, which means no recursion.

lights0123 · on Dec 13, 2020

For Arduino specifically, it certainly does have an allocator and heap by default, even on MCUs with less than 1KB RAM. However, there's no separation, so the stack and heap grow towards each other. If the heap is never used, it always stays at a size of 0. For code side, just not using malloc or anything that uses it will cause it not to be linked.

beached_whale · on Dec 13, 2020

I built a memory allocator for testing not too long ago that only allocated but no-op'd deallocate(using the C++ Allocator interface). It was quite amazing that it resulted in a 25-50% performance boost in many cases I was testing. It's not for real code as you only get destruction but not deallocation. It let me pull that part out of the tests to get a better view of the performance

troutwine · on Dec 13, 2020

In case you weren't aware, this is a common enough technique for it to actually have a name: bump allocation. It's very useful in short lived programs with bounded allocations, for the exact reason you outlined.

MaxBarraclough · on Dec 13, 2020

The dmd D compiler adopted this approach and got a significant performance boost. [0] Looks like it still uses a pointer-bump allocator today. [1]

[0] https://web.archive.org/web/20190126213344/https://www.drdob...

[1] https://github.com/dlang/dmd/blob/025caed79/src/dmd/root/rme...

dnautics · on Dec 13, 2020

this is why composable allocators are fantastic. You can request a dynamic memory slate for a given context, then allocate as you wish, no-op delete, when you're done with the context, you delete all of them in one go.

fermentation · on Dec 13, 2020

Yes as it turns out malloc is much faster if free doesn’t exist.

saagarjha · on Dec 13, 2020

Not just that: malloc is faster if you don’t have to care about thread safety, or support interposing, or hooks, or if security is not a concern. It is easy to “beat malloc” in a specific usecase if you know what it will be beforehand. Doing this in general is what is hard.

bholagabbar · on Dec 13, 2020

Nice! At CMU, we have 15-213, an introductory course on computer systems and one of the hardest/fun assignments is implementing malloc. We end up using seglists, mini-blocks and other optimizations, ending up with a faster implementation than the standard malloc itself on several benchmark tests.

akandiah · on Dec 13, 2020

I believe Bryant and O'Hallaron wrote a text book just for this course. You can always "optimize" it by reverse engineering the benchmark app and providing the correctly aligned blocks to the bench-marking application. It'll beat the standard malloc calls by several magnitudes if you were to do that.

This is one of the reasons why people write custom allocators - to suit a specific purpose where the allocation pattern is known well in advance.

porknubbins · on Dec 13, 2020

I did that lab and I recall the professor saying “this will be the hardest code some of you have ever written.” I supposed thats true if its an intro course at CMU? Debugging was certainly mind numbing reading pages of hex to figure out where the bug is. I don’t think I ever did get realloc to pass all tests.

nicioan · on Dec 13, 2020

Reminded me of Doug Lea‘a classic reference on memory allocation http://gee.cs.oswego.edu/dl/html/malloc.html

akandiah · on Dec 13, 2020

This is probably the best way to learn about pointers in C. The book "Computer Systems: A Programmer's Perspective" addresses this area very well: https://www.pearson.com/us/higher-education/product/Bryant-C...

Debugging it is great fun as well.

IgorPartola · on Dec 13, 2020

I am curious, is it possible to implement a headerless allocator? Basically keep an index of free blocks and an index of used blocks using a B-tree or a skip list, and allocate and free them by simply moving from one index to the other. Presumably this would make it more cache friendly.

Toying with allocators is one of my favorite programming pastimes for some reason.

powerapple · on Dec 13, 2020

I remember I did a memory allocator incompletely, and it did increase the speed of my application, but then debugging memory problem become less ideal because valgrind couldn't tell about memory boundaries any more. Make sure you put proper guard in debugging to trigger segfault :)

gmueckl · on Dec 13, 2020

Valgrind comes with C macros for exactly that kind of use case: you can inform it about the state of your custom allocator, i.e. which parts of the heap are available/forbidden when this differs from its own tracking of the system calls and malloc usage.

saagarjha · on Dec 13, 2020

Address Sanitizer has the same.

vinay_ys · on Dec 13, 2020

A modern lecture on memory allocators should have covered aspects to developer ergonomics w.r.t memory safety features like leak detection, use after free etc and allocation statistics in general. These are must-have features of any memory allocator library.

MaxBarraclough · on Dec 13, 2020

A good introduction, but I was surprised it doesn't mention fragmentation at all.

markus_zhang · on Dec 14, 2020

Hmmm, somehow this course is not included in my corporate learning account, what a pity...