A GC doesn't check if memory is unused; it looks for memory that's used and free...

yosefk · on May 13, 2017

Got it, an interesting point and it follows that a use case can be engineered where a GC adds essentially zero overhead over an arena allocator.

But in the general case where some objects are short-lived and others aren't, surely manually splitting your allocations to malloc for long-lived ones and some sort of arena_alloc for short-lived ones ought to be faster than allocating them all in one place, then copying the ones which are still reachable out of the area reserved for short-lived objects?

(This is not to say a GC-based system will be slower "on average" because nobody knows what the "average" is. A realistic arena-based system can have objects most of which are short-lived but some do need to live longer and you only find out long after they're allocated; in that case, one has to manually reallocate those objects just like a GC would, and doing it, say, the C++ way is definitely more bug-prone than GC's bug-free handling of this, and one way to make it less bug-prone in C++ is to have deeper copies and avoid trying to minimize copying, and now you might easily be slower than a GC. I'm just saying that it's very easy to find a case when a system not getting any hints from the programmer wrt object lifecycles and instead discovering them fully automatically would be slower than a system which does get these hints. And of course a GC can provide ways to supply these hints, I'm just not aware of one which does - perhaps it's avoided on the theory that the GC algorithm might change and you don't want to make hints which operate in terms not portable between algorithms a part of your interface.)

rbehrends · on May 13, 2017

> But in the general case where some objects are short-lived and others aren't, surely manually splitting your allocations to malloc for long-lived ones and some sort of arena_alloc for short-lived ones ought to be faster than allocating them all in one place, then copying the ones which are still reachable out of the area reserved for short-lived objects?

This is where things get difficult, actually, and you need benchmarks because:

1. You still have a bump allocator that's much faster than a general purpose first-fit/best-fit allocator and has generally better memory locality than a pool allocator.

2. Offsetting that may be the additional tracing and/or copying you are doing as a result of garbage collection.

3. Manual memory management techniques often have their own performance costs: std::shared_ptr and std::unique_ptr both add overhead and you sometimes see additional copying where lifetimes are difficult to predict.

Which one has the higher cost is often something that can be only tested for with actual code (and can go either way).

I'll also note that this is primarily of interest for functional languages, which often have a high allocation rate. For imperative programs, the memory allocation rate (and ratio of memory that contains pointers to memory that doesn't) is often so low that even a very basic mark/sweep collector would be fine, as long as pause times don't matter in your application domain. This means that it's often not really a practical issue, one way or the other.

barrkel · on May 13, 2017

IMO the design of the memory system has a strong influence (or should have a strong influence) on the architecture of a system when performance is a big concern. What the "average" is, is under the control of the program itself; a program written for a generational GC should be written differently to one with a refcount-based GC, and both should be different to one for a simple mark and sweep GC.

Similarly, designing a program with manual allocation for speed means using some kind of zone or arena allocation, probably mixed with some stack-oriented allocation that mirrors the control stack (not necessarily consuming CPU stack space). The design of the memory system needs to be integrated with the architecture of the program and the lifecycle of the values it needs to track.

I don't think there's any simple winner. Different problem spaces require different treatments of memory. For example, stateless servers have little need for long-lived memory; they're well suited to generational GCs, but also to arenas (though GC is easier to keep correct, and usually more fluent in practice, without rewriting too much of the standard library). Mix in in-process caching, and things start getting murkier. Put caching in a different process, keeping it simple, at the cost of some IPC; tradeoffs, etc.

gsg · on May 13, 2017

In general you can't drop the contents of the arena on the floor in a GCed language, since you might have been wrong about those objects being short-lived (or because they are short lived, but they were allocated shortly before the young heap filled up and happen to still be alive). So tracing/copying is still necessary.

It is quite possible for objects that are expected to tenure to be allocated directly into the major heap. Many systems use various criteria to decide when this should be done. In particular it is common to allocate very large objects directly to avoid ever having to copy them.

There are also static analyses which attempt to determine which objects can be safely allocated in an arena-like way, that is, region inference. In some systems in addition to inference the programmer is given ways to specify that a given object should live within a certain region. MLkit is the usual example of such a system. Region systems don't seem to have been particularly popular, but there's still some work being done on them here and there.