> Is this ever an real issue, even on any embedded system in the last 20 years?
Ask Cisco when they cut the Linksys routers' RAM in half a few years ago. Every byte counts. Component cost savings add up when you make a few million of them.
Probably yes, because Intel knows this is the code every compiler outputs for zeroing a register.
Also, the reason it is "faster" is that the encoding is 1 byte, vs. 9 bytes (in 64 bit) for "mov rbp, 0" - roughly, 1 for "mov rbp,", 8 more for a 64 bit "0".
Technically you could get by with 5 bytes for "mov ebp, 0".
Another reason why it was faster was that the processor recognized it and avoided partial flags stalls after an "inc". But in 64-bit code you rarely have "inc" at all, so it matters less. On the other hand, a few years ago XOR had a false dependency on the register you're clearing; I'm not sure it is still that way on more recent processors.
`malloc`'s not really that bad. There's a few different approaches you can take, but none of them are terribly complicated since the two basic memory allocation interfaces, `sbrk` and `mmap`, are fairly simple in terms of usage for generic allocations. But getting it all working and bug free still takes time. Same with stuff like `printf` and `scanf` (Though I'd actually argue those are harder to write then `malloc` if you're looking to be feature complete. `printf` has billions of features and I'm pretty sure `scanf` requires some extra black-magic internally).
There's no doubt that this is a fun project though - if you or someone-else enjoys this type of stuff, you should definitely try your hand at writing a simple Unix kernel or similar, you'd probably enjoy it.
On that note though, the writers aversion to inline assembly is unfortunate. It's a necessary evil for this type of programming. The syntax is ugly, but it's not really that hard to get used too (Especially since the large majority of inline assembly is just a few lines long, or even just one line long). In particular, the syscall wrappers can be done in a one-line piece of inline assembly, and then you can avoid the function-call overhead for the syscall by placing the inline assembly in a `static inline` function in your headers (Or a macro if you prefer), as well as avoid the extra .S file (Which IMO is the better part - it's always easier when you don't have to mix different languages like that).
I would also add that, while I used to share the aversion for AT&T asm syntax the author does, virtually all of the assembly code out there related to linux is written in AT&T, so it's worth it to get used to it and at least be able to read it. On that note, you can use the Intel syntax in inline assembly though, if you prefer, so even if you hate AT&T with a passion you can still write inline assembly ;)
You can get surprisingly far without using libc's malloc/free. E.g. TeX, the typesetting system by Knuth, implements its own dynamic memory handling. It has a large static array of bytes, and allocates from that when needed.
Arenas are really nice if you're allocating a lot of objects of the same size, whereas malloc() must be prepared to handle a lot of different memory usage patterns.
I don't think it is. Differently sized objects can be allocated and released individually. Have a look at part 9 of [1]. In an arena based allocator you typically deallocate all the objects in an arena at once.
TeX basically uses a special purpose implementation of malloc/free, with a static array as backing instead of memory requested from the OS with mmap(2) or sbrk(2). The main reason is portability (the original version was released in 1978 using WEB/Pascal).
FreeRTOS also provides a few malloc implementations backed by static arrays (not dependent on sbrk), which can be useful for running malloc-based test code on embedded platforms without native malloc: http://www.freertos.org/a00111.html
While one of the benefits of an arena allocator is to be able to deallocate everything at once, it's not that unusual to have an arena allocator that you can deallocate from "early" if needed.
For short-lived processes that do lots of allocations and where you can rely on the OS to release resources, just leaving out the deallocations is often faster.
Of course, you need to be careful as if you write code like that in a language without garbage collection, it's inherently not reuseable - retrofitting deallocation is often really painful because it gets easy to adopt patterns that make object ownership etc. unclear when you don't have to ensure it's easy to deallocate in the right order.
Not surely. Many programs are written in a way that allocates all the needed heap space at startup, and just reuse it forever. And those are overrepresented on the minimal-system kinds of environment.
malloc() isn't particularly hard; K&R provides a working implementation using a freelist and sbrk in about a page of code. It's printf() that's the horrendous feature-crammed nightmare.
Ask Cisco when they cut the Linksys routers' RAM in half a few years ago. Every byte counts. Component cost savings add up when you make a few million of them.