Every time I hear someone talking about writing a memory allocator, it always reminds me of the brilliant CppCon 2015 talk by Andrei Alexandrescu about how he wrote an modular allocator library that can be built up in pieces to make whatever you like.
The only problem is that he is wrong on almost every detail. He has apparently never heard of mmap, which is awkward because mmap is how memory is obtained, nowadays.
And if you want, you know, performance, it needs to know something about threads, pre-allocating chunks to assign to threads, and batching free ops per thread.
Most of what you can find about writing allocators is written by people who don't actually know how; and actually useful allocators are made by people who don't write tutorials about it.
What is he wrong about? He's upfront about limiting the scope to single-threaded arena allocators, but still mentions mmap and threads in passing. For talks that are mostly sales pitches, I think it's a pretty informative introduction to how to select and compare allocators.
mmap is not only to map a file content into a process address space, it can also be used to request memory from the OS using the MAP_ANONYMOUS flag. Early on a malloc-like allocator would use brk/sbrk to get a chunk of "raw" memory, and manage this for the user providing access using malloc/free. Nowadays the same "raw" memory can be obtained using an anonymous map. A raw pool, when unused, can be returned to the OS by unmapping it.
Using an anonymous map provides address space randomization by default (for the raw pool location), whereas brk/sbrk extends a limit and because of this are predictable.
Thread-based allocators are a bit overrated. In apps it’s quite common for objects to migrate across threads, especially from workers to the main thread.
Then it goes to the freeing thread memory pool. Thread-aware mallocs know this and have clever balancing mechanisms for situations like producer-consumer, where one thread allocs a lot and another frees a lot.
Edit: wrong, I re-checked. The solution to this is that marking an object as free is lock-free. The freed objects do not immediately change thread pools.
It’s a great talk, but something I would watch after learning how to implement a malloc as it talks about other kinds of allocators and how to make them composable rather than say “how would I encode a freelist”.
I've been listening to Dmitry's lessons and lectures lately, I find they are really straightforward and well delivered. You can find some of his other lectures here:
Implementing a memory allocator seems to require system calls. What happens when there's no OS? For example on an MCU like Arduino, is the memory allocator packaged along with the program data? If so and in the case of an Arduino, is the code for it available anywhere? I would be curious to see how this is implemented on an embedded system.
In every allocator there has to be a way to get “the memory” that is available for you to hand out. When a normal OS is present, system calls are used to get this in the form of pages. On embedded environments you don’t need to do this because “the memory” is just something like “addresses from 0x2000000 to 0x8000000” and you don’t have to ask for it, it’s just there for you to use.
You could fairly easily implement a bitmapped memory allocator which hands out chunks of reserved memory. I think you could build a more complex allocator on top of that if needed. For a lot of embedded applications, you really want to avoid dynamic memory allocation though. Use mempools if you need dynamic memory, which will at least limit fragmentation.
Second this. It’s really easy to fragment memory, and for long running embedded systems it’s not uncommon for free() to be a nop. I had to implement malloc just last week to access psram, and after doing a little research decided all I needed was an offset and some byte alignment. Since I never needed to deallocate the memory, it was very simple code.
You generally don’t have dynamic memory allocation on such a system. All memory will either be statically allocated globals, or on the stack. Some systems even require stack allocation to be statically determined, which means no recursion.
For Arduino specifically, it certainly does have an allocator and heap by default, even on MCUs with less than 1KB RAM. However, there's no separation, so the stack and heap grow towards each other. If the heap is never used, it always stays at a size of 0. For code side, just not using malloc or anything that uses it will cause it not to be linked.
I built a memory allocator for testing not too long ago that only allocated but no-op'd deallocate(using the C++ Allocator interface). It was quite amazing that it resulted in a 25-50% performance boost in many cases I was testing. It's not for real code as you only get destruction but not deallocation. It let me pull that part out of the tests to get a better view of the performance
In case you weren't aware, this is a common enough technique for it to actually have a name: bump allocation. It's very useful in short lived programs with bounded allocations, for the exact reason you outlined.
this is why composable allocators are fantastic. You can request a dynamic memory slate for a given context, then allocate as you wish, no-op delete, when you're done with the context, you delete all of them in one go.
Not just that: malloc is faster if you don’t have to care about thread safety, or support interposing, or hooks, or if security is not a concern. It is easy to “beat malloc” in a specific usecase if you know what it will be beforehand. Doing this in general is what is hard.
Nice! At CMU, we have 15-213, an introductory course on computer systems and one of the hardest/fun assignments is implementing malloc. We end up using seglists, mini-blocks and other optimizations, ending up with a faster implementation than the standard malloc itself on several benchmark tests.
I believe Bryant and O'Hallaron wrote a text book just for this course. You can always "optimize" it by reverse engineering the benchmark app and providing the correctly aligned blocks to the bench-marking application. It'll beat the standard malloc calls by several magnitudes if you were to do that.
This is one of the reasons why people write custom allocators - to suit a specific purpose where the allocation pattern is known well in advance.
I did that lab and I recall the professor saying “this will be the hardest code some of you have ever written.” I supposed thats true if its an intro course at CMU? Debugging was certainly mind numbing reading pages of hex to figure out where the bug is. I don’t think I ever did get realloc to pass all tests.
I am curious, is it possible to implement a headerless allocator? Basically keep an index of free blocks and an index of used blocks using a B-tree or a skip list, and allocate and free them by simply moving from one index to the other. Presumably this would make it more cache friendly.
Toying with allocators is one of my favorite programming pastimes for some reason.
I remember I did a memory allocator incompletely, and it did increase the speed of my application, but then debugging memory problem become less ideal because valgrind couldn't tell about memory boundaries any more. Make sure you put proper guard in debugging to trigger segfault :)
Valgrind comes with C macros for exactly that kind of use case: you can inform it about the state of your custom allocator, i.e. which parts of the heap are available/forbidden when this differs from its own tracking of the system calls and malloc usage.
A modern lecture on memory allocators should have covered aspects to developer ergonomics w.r.t memory safety features like leak detection, use after free etc and allocation statistics in general. These are must-have features of any memory allocator library.
https://www.youtube.com/watch?v=LIb3L4vKZ7U