More

gsg · on Aug 15, 2021

This works until the analogue of monotremes shows up to ruin your supposedly flawless categorisation.

nfw2 · on Aug 15, 2021

I didn't say the categorization was flawless; I said it reduced the complexity of the information. Categories can be useful without being flawless.

rmdashrfstar · on Aug 16, 2021

It is not a “can be”, it’s a “must be”. Models or an abstraction of complexity are necessarily both reductionist and useful.

gifnamething · on Aug 15, 2021

YAGNI

gsg · on June 30, 2021

System F doesn't have general recursion.

Extensions with a letrec-like construct are common, and are sometimes inaccurately called 'System F', but those languages do not have the properties of System F.

scapp · on June 30, 2021

You're right. I must have been thinking of one of those extensions you're talking about (F# maybe?). I should have remembered that System F is part of the lambda cube, so it's at least as consistent as CoC

gsg · on June 4, 2021

There are some examples, for example Crowbar is an OCaml tool that uses AFL to drive property based tests.

gsg · on March 12, 2021

chrisseaton is talking about the bump-pointer allocator in a modern GC, not an implementation of malloc/free. The performance characteristics are quite different.

In a generational copying system an object that is bump allocated and then is dead before being copied out of the young generation is indeed cheap - dead objects in the young generation don't need to be freed or even looked at in any way because the space for the young generation can simply be reused after everything is copied out of it. The slow part is elsewhere.

CyberDildonics · on March 12, 2021

There are no situations where it makes sense to say that lots of allocations are cheap and only cost a few instructions.

If lots of allocations are happening without freeing, more memory will have to be mapped in anyway on top of the constant pointer bumping - which means that a single allocation of an array would make sense instead.

If lots of allocations are happening with freeing, 'objects'/allocations will have to be moved around constantly to cater to the 'cheap' bump allocation.

If lots of allocations are happening then all being freed uniformly at the same time, it makes no sense to use lots of allocations since one array allocation would be fine.

Basically, making lots of tiny allocations marginally cheaper is a terrible strategy since it will always be a drag on performance since it is a naive and wasteful way to write software.

kragen · on March 12, 2021

> There are no situations where it makes sense to say that lots of allocations are cheap and only cost a few instructions.

This is so trivially shown to be false that I suspect you are trolling. There is, for example, the situation I showed in https://news.ycombinator.com/item?id=26438596. Moreover, as gsg says, this is very generally true of pointer-bumping allocators in modern GCs; SBCL's GC isn't even all that modern.

CyberDildonics · on March 12, 2021

What you are talking about is the equivalent of adding 1 to the size of a C++ vector with a large capacity on each iteration of a loop.

The memory has already been mapped in (through a blocking system call) and it is contiguous. The "allocation" here is just the same uniform increment in an unbroken span of memory already in the process without taking into anything that makes memory allocation problematic for performance.

There is no point to the situation you described, it is a nonsense way to do something trivial. If that's what you want, allocate a single array.

The fact remains that memory allocation is often huge low hanging fruit for optimization. If it was 'just a few instructions' this would not be true. Garbage collection does not change this. You can read endless threads about 'performance java' people taking about allocating all their memory when their program starts and never inside their loops.

The reasons why have mostly been explained (system calls, heaps, copying in GCs, cache effects, TLB effects, etc.) though people have left out other nuances like mapped pages triggering interrupts on their first write to save time in the memory mapping call.

I think it will probably help to look into this further, thinking memory allocation is bumping a pointer is a simplification that will confuse any optimization until you understand it and profile.

chrisseaton · on March 13, 2021

> There is no point to the situation you described, it is a nonsense way to do something trivial.

I mean... it's how real allocators work in real VMs, so it's not nonsense, is it?

CyberDildonics · on March 13, 2021

I'm not even sure I understand what you are asking, do you think that memory allocation is always bumping a pointer a uniform amount with no memory being freed, no variance in allocation size and no interlaced lifetimes?

You originally said that memory allocation was 'just a few instructions' when the reality is that it is complex and avoiding allocation is the first thing to do when optimizing software.

Saying that memory allocation is simple and fast based on a scenario that should never happen because it's useless is a little silly. No, that is not how allocators work.

chrisseaton · on March 13, 2021

> do you think that memory allocation is always bumping a pointer a uniform amount with no memory being freed, no variance in allocation size and no interlaced lifetimes?

In modern language implementations with moving collectors allocation is literally what I've described in the fast path. Not theoretical. That's literally the instructions used.

> You originally said that memory allocation was 'just a few instructions'

Here it is in a real industrial compiler and garbage collector, for example.

https://github.com/oracle/graal/blob/f9530b0948c58a66845e84f...

It's precisely as I've described.

CyberDildonics · on March 13, 2021

It is not at all. That's like someone looking at perfect weather, blowing up an air mattress on the beach and wondering why anyone would need a house.

You are only on the 'fast path' when a single allocation of an array would have been much faster anyway. It isn't difficult to be 'fast' when accomplishing something trivial that would be even faster if done directly.

Why do you keeping saying the same thing while ignoring freeing memory, different lifetimes and wildly different sizes of allocations (which are why memory allocators actually exist)?

This idea that memory allocation is cheap is a blight on performance. Are you really allocating memory inside your hot loops and having it not show up when you profile?

chrisseaton · on March 13, 2021

> You are only on the 'fast path'

Isn't that what I've said all along? I said 'amortised' in my very first comment. Most allocations are from the fast path... that's why they bother to make it fast.

> when a single allocation of an array would have been much faster anyway

TLAB allocation is heterogenous (which solves your size question.) An array isn't. TLAB can be evacuated and fragmention-free (which solves your lifetime problem.) An array can't.

> Why do you keeping saying the same thing while ignoring freeing memory

I said 'Deallocation though - yes that's slow!' in my very first comment.

CyberDildonics · on March 13, 2021

Your original comment was trying to argue that allocations aren't a big performance problem. Anyone with experience optimizing knows this is not true and that minimizing allocations is the first thing to look at because it will most likely have the largest payoff with the least effort. Incorrect preconceptions and misinformation doesn't help people.

Saying 'allocations are fast' because you can make unnecessary allocations cheaper while ignoring deallocation is basically a lie, it's playing some sort of language game while telling people the opposite of what is actually true.

> Most allocations are from the fast path... that's why they bother to make it fast.

This is where you have a deep misconception. Doing millions of allocations that could be one allocation is not fast. If you want to add 1000000 to a variable do you loop through it one million times and increment it by 1 every time?

> TLAB allocation is heterogenous (which solves your size question.) An array isn't. TLAB can be evacuated and fragmention-free (which solves your lifetime problem.) An array can't.

That's like building a skyscraper out of legos because they fit together so well. It's nonsensical, especially due to pages and memory mapping.

> I said 'Deallocation though - yes that's slow!' in my very first comment.

Do you have a house with a kitchen and no bathroom? The discussion is that excessive memory allocation is a huge factor in performance. Trying to play language games to ignore the entire cycle doesn't change that and misleads people. Allocated memory needs to be deallocated. Lots of tiny allocations of a few bytes each are a mistake. They cause huge performance problems and should be obviously unnecessary. You wouldn't pick one piece of cereal out of the box, you would pour it in.

Do you profile your programs? Do you work on optimizing them? I have a hard time believing you are doing something performance sensitive while trying to rationalize massive memory allocation waste.

chrisseaton · on March 13, 2021

> This is where you have a deep misconception. Doing millions of allocations that could be one allocation is not fast. If you want to add 1000000 to a variable do you loop through it one million times and increment it by 1 every time?

Great example... because guess what happens with TLAB allocation during optimisation? It will do exactly what you say and combine multiple allocations into a simple bump, like it would with any other arithmetic!

> Anyone with experience optimizing knows this is not true ... Do you profile your programs? Do you work on optimizing them? I have a hard time believing you are doing something performance sensitive

I'm not going to keep arguing on this forever. But I have a PhD in language implementation and optimisation and I've be working professionally and publishing in the field for almost a decade. I'd be surprised to find out I have deep misconceptions on the topic.

CyberDildonics · on March 13, 2021

> Great example... because guess what happens with TLAB allocation during optimisation? It will do exactly what you say and combine multiple allocations into a simple bump, like it would with any other arithmetic!

Before you were saying it isn't a problem at all, now you are saying it isn't a problem because a single java allocator tries to work around it? That's a bandaid over something that you are trying to say isn't even a problem.

> I'm not going to keep arguing on this forever. But I have a PhD in language implementation and optimisation and I've be working professionally and publishing in the field for almost a decade. I'd be surprised to find out I have deep misconceptions on the topic.

But are you profiling and optimizing actual software? That's the real question, because as I keep saying, anyone optimizing software knows that lots of small allocations are the first thing to look for. You still haven't addressed this, even though the whole thread is about decreasing allocations for optimization and you seem to be saying it isn't a problem, which is misinformation.

chrisseaton · on March 13, 2021

> Before you were saying it isn't a problem at all

It isn't a problem - I was responding to what you thought was a problem - having to bump multiple times in loop - that's what you asked me about? The bumps are tiny to start with, but even from that tiny start point they still get collapsed. You said collapsing them was important, so I showed you how that also happens.

> as I keep saying, anyone optimizing software knows that lots of small allocations are the first thing to look for

In some cases I've made code faster by putting allocations back in that other people removed without enough context on how allocation works and based on cargo culting of allocation being simply bad like you're pushing. Sometimes allocation is much better than reusing existing allocated space, due to reasons of spatial and temporal locality, cache obliviousness, escape analysis, publication semantics, coherence and NUMA effects, and more.

CyberDildonics · on March 13, 2021

To be clear, you specifically brought up an optimization that you are now saying was done for no reason since it gains nothing.

Now you are trying to say that you actually optimize programs by taking minimal memory allocation and instead using millions of tiny allocations of a few bytes each? Do I have this right? Feel free to link that example.

> Sometimes allocation is much better than reusing existing allocated space, due to reasons of spatial and temporal locality, cache obliviousness, escape analysis, publication semantics, coherence and NUMA effects, and more.

Are we to the part where you are just throwing out terminology?

Are you really digging in this hard? Virtually everything written about optimizing memory allocations is about doing it less.

Even this article is about decreasing memory allocations for performance. You made a claim that is very unsound in a pragmatic sense and anyone with experience would recognize that. You can just own up to that instead of going into a territory of wild claims and diversions.

chrisseaton · on March 13, 2021

> You can just own up to that instead of going into a territory of wild claims and diversions.

Mate I don’t know why you’ve gotten so worked up and aggressive about this.

But if you think I don’t know what I’m talking about despite all my work and publications on the subject then feel free to ignore me and go about your day happily.

CyberDildonics · on March 13, 2021

No one is being worked up or aggressive, but I suppose this is the 'focus on how it was said' part.

To recap you started with a bold claim and instead of confronting or explaining the large holes you repeated the same thing more forcefully, gish galloped onto irrelevant issues, directly contradicted yourself with your own example, made an even greater claim about more allocations being faster, then moved on to calling me "aggressive".

Why not just confront the actual topic and the actual points I made? Why work so hard to avoid them?

kragen · on March 20, 2021

Your comment is false from beginning to end, which I suppose is to be expected from a GPT-2-driven Eliza bot.

kragen · on March 13, 2021

> Anyone with experience optimizing knows this is not true...This is where you have a deep misconception. …Do you profile your programs?

There's Dunning–Kruger, and then there's Dunning–Kruger of the level "telling a guy whose Ph.D. dissertation, Specialising Dynamic Techniques for Implementing The Ruby Programming Language, is about code optimization on GraalVM, that he has a deep misconception, and questioning whether he has any experience optimizing".

I repeat my complaint about "the climate of boastful intellectual vacuity this site fosters."

CyberDildonics · on March 14, 2021

You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

I'm not really sure how anything I'm saying is even controversial unless someone is desperate for it to be.

Anyone who has profiled and optimized software has experience weeding out excessive memory allocations since it is almost always the lowest hanging fruit.

No matter how fast allocating a few bytes is, doing what you are ultimately trying to do with those bytes is much faster.

Allocating 8 bytes in 150 cycles might seem fast, until you realize that modern CPUs can deal with multiple integers or floats on every single cycle.

A 12 year old CPU can allocate space for, and add together well over 850 million floats to 850 million other floats in a single second on a single core. You can download ISPC and verify this for yourself. By your own numbers, the allocation alone would take about a minute and a half.

Neither of you has confronted this. I'm actually fascinated by the lengths you both have gone to specifically to avoid confronting this. Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice and here I have explained why that is.

chrisseaton · on March 14, 2021

> You realize this person that you are calling an expert claimed that they optimize software by putting tiny memory allocations into their tight loops right?

You're either misunderstanding, or pretending to misunderstand for some reason.

What I said was that allocating fresh objects and using those can be faster than re-using stale objects in some failed attempt to optimise by reducing allocations.

Why would that be? For the reasons I explained: The newly allocated objects are guaranteed to already be in cache. Each new object is guaranteed to be close to the last object you used, because they're allocated next to each other. The new objects are not going to need any memory barriers, because they're guaranteed to not be published. The new objects are less likely to escape, so they're eligible for scalar replacement.

You dismissed all that as 'throwing out terminology'.

Here's a practical example:

  require 'benchmark/ips'

  def clamp_fresh(min, max, value)
    fresh_array = Array.new
    fresh_array[0] = min
    fresh_array[1] = max
    fresh_array[2] = value
    fresh_array.sort!
    fresh_array[1]
  end

  def clamp_cached(cached_array, min, max, value)
    cached_array[0] = min
    cached_array[1] = max
    cached_array[2] = value
    cached_array.sort!
    cached_array[1]
  end

  cached_array = Array.new

  Benchmark.ips do |x|
    x.report("use-fresh-objects") { clamp_fresh(10, 90, rand(0..100)) }
    x.report("use-cached-objects") { clamp_cached(cached_array, 10, 90, rand(0..100)) }
    x.compare!
  end

Which would you think is faster? The one that allocates a new object each iteration of the inner loop? Or the one that re-uses an existing object each time and doesn't allocate anything?

It's actually the one that allocates a new object each time. The cached one is 1.6x slower in an optimising implementation of Ruby.

It's faster... but the only change I made was I added an object allocation instead of the custom object caching. I went from not allocating any objects to allocating an object and it became faster. This example is so clear because of the last factor I mentioned - scalar replacement.

If you came along and 'optimised' my code based on a cargo cult idea of 'object allocation disastrously slow' you wouldn't be helping would you?

kragen · on March 14, 2021

> You realize this person that you are calling an expert

I didn't claim he's an expert, but he did write TruffleRuby, which is still about twice as fast as any other implementation of Ruby six years later: https://pragtob.wordpress.com/2020/08/24/the-great-rubykon-b.... So I think it's safe to say that he's at least minimally competent at performance engineering :)

> I'm not really sure how anything I'm saying is even controversial

You're applying rules of thumb you've learned in one context in a context where they are invalid, and then you're accusing people who disagree with you of being ignorant or dishonest, even when it would take you literally 30 seconds to verify that what we're saying is correct, and where one of us literally has a doctorate in the specific topic we're discussing.

> Allocating 8 bytes in 150 cycles might seem fast

150 cycles doesn't sound terribly fast to me, though it'd be pretty good for malloc/free under most circumstances. I presented benchmark results from my 9-year-old laptop where LuaJIT did an allocation every 120 clock cycles, SBCL did an allocation every 18 cycles, and OCaml did an allocation every 9.5 cycles. You can easily replicate those results on your own machine. And that isn't the time for the allocation alone—it includes the entire loop containing the allocation, which also initializes the memory, and also the garbage-collection time needed to deallocate the space allocated. (And, in the OCaml case, it includes a recursive loop.)

chrisseaton says that in modern VMs like GraalVM an allocation is about 5 instructions, which I'm guessing is about 3 cycles.

(Incidentally on my laptop with glibc 2.2.5 malloc(16) and free() in a loop only takes about 18 ns, about 50 cycles, 55 million allocations per second. It took a little effort to keep GCC from optimizing away the loop. I wasn't satisfied until I'd single-stepped GDB into __GI___libc_malloc!)

120 cycles is less than 150 cycles. 18 cycles is a lot less than 150 cycles. 9.5 cycles is extremely much smaller than 150 cycles. 3 cycles, which I haven't verified but which sounds plausible, is smaller still.

> A 12 year old CPU can ... add together well over 850 million floats to 850 million other floats in a single second on a single core ...By your own numbers, the allocation alone would take about a minute and a half.

You're completely wrong about "By your own numbers."

It's surely true that you aren't going to get SIMD-like performance by chasing pointers down a singly-linked list of floating-point numbers (is that what you're suggesting?) but not because you can only allocate 10 million list nodes per second; it's because you'll trash your cache with all those worthless cdr pointers, the CPU can't prefetch, and every cache miss costs you a 100-ns stall, which often ties up your entire socket's memory bus, as I specifically pointed out in https://news.ycombinator.com/item?id=26438596. It's true that if you use malloc you'd need a minute or two (and 30 gigabytes!) to allocate 850 million such nodes, or maybe three minutes and 60 gigabytes to allocate two such lists. But if you use the kind of allocator we're talking about in this thread, you can do that much allocation in a few seconds rather than a few minutes, although it's still not a good way to represent your matrices and vectors.

> Neither of you has confronted this.

Well, no, why would we? It's just a silly error you made. Before you made it there was no way to confront it.

> Saying that lots of small allocations has no impact on speed is counter to basically all optimization advice

Allocations aren't zero-cost, even when multiple allocations in a basic block get combined as chrisseaton says GraalVM does, unlike any of the three systems I presented measurements from; at a minimum, allocating more requires the GC to run more often, and you need to store a pointer somewhere that points at the separate allocation. (Though maybe not for very long.) As I mentioned in my other comment, I've written a popular essay exploring this topic at http://canonical.org/~kragen/memory-models.

But the time required to allocate a few bytes can vary over two or three orders of magnitude, from the 3.4 ns I got with OCaml (or maybe 1 ns if several allocations get combined), up to the 43 ns I got with LuaJIT, up to the 100 ns malloc/free typically take, up to the 200 ns we're suffering in Hammer, up to the maybe 1000 ns you might get out of a bad implementation of malloc/free, up to maybe several microseconds on a PIC16 or something.

If you're trading off a 5-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should probably use the allocation, although questions like locality of reference down the road might tip the balance the other way. If you're trading off a 100-nanosecond small allocation against a 20-nanosecond L2 cache miss, you should take the cache miss and remove the allocation. If you're trading off a 7-nanosecond small allocation in SBCL against triggering a 10000-nanosecond write-barrier by way of SBCL's SIGSEGV handler, you should definitely take the small allocation.

So, whether lots of small allocations speed your code up or slow it down depends a lot on how much small allocations cost, which depends on the system you're running on top of and the tradeoffs it's designed for.

You're familiar with a particular set of architectural tradeoffs which make small allocations pretty expensive. That's fine, and those tradeoffs might even be in some sense the objectively best choice; certainly they're the best for some range of applications. But you're mistakenly assuming that those tradeoffs are universal, despite overwhelming evidence to the contrary, going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

CyberDildonics · on March 14, 2021

You get so worked up but my point with every message is that excessive memory allocation is a performance killer. Part of the reason is cache misses which you went into here. For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

> despite overwhelming evidence to the contrary,

The overwhelming evidence is that tiny memory allocations will always carry more overhead than operating on the data they contain. Excessive memory allocations imply small memory allocations and small memory allocations will have overhead that outweighs their contents. This actually is about as universal on modern CPUs as you can get. This is why allocating a few bytes at a time in a loop dominates performance, which all I've really been saying.

> going so far as to put false words in my mouth in order to defend your point of view. Please stop doing that.

Relax, you might learn something

kragen · on March 27, 2021

> my point with every message is that excessive memory allocation is a performance killer

Yes, it was clear that that was what you were saying, despite clear and convincing evidence that it's true in some contexts and false in others.

> For all your rabbit holes and straying off topic, it's pretty clear you know a lot more about this than chriseaton.

chrisseaton knows more about this topic than I ever will. You, by contrast, evidently don't know enough about it to understand why the things we were bringing up were relevant, miscategorizing them as "rabbit holes and straying off topic".

> Ruby runs 50x to 100x slower than native software, let alone well optimized native software. It is not a context that makes sense when talking about absolute performance.

As anyone who follows the links provided can see, some performance-critical benchmark libraries built with chrisseaton's TruffleRuby run about 1.1× slower than alternatives written in native C, although performance on larger programs is, so far, less impressive. Even the mainstream Ruby implementation MRI is typically only about 30× slower. It's true that ten years ago Ruby had performance overhead of 50× to 100×.

> Relax, you might learn something

Oh, I've learned a lot from this thread. But it wasn't by reading the GPT-2-generated Eliza-bot rants posted under the name "CyberDildonics" with no understanding of the issues; it was by reading chrisseaton's dissertation, writing and running microbenchmarks, and carefully disassembling compiler output.

kragen · on March 13, 2021

I guess you aren't trolling; you're just confusing the part of programming that you know about with the whole field. But there are more things in heaven and earth, Horatio, than are dreamt of in your philosophy… and your epistemic hygiene is sorely lacking, so maybe "misosophy" is more appropriate.

You said, "If you allocate a a few bytes at a time, it will top out in the ballpark of 10 million per second per core." In my link above, I demonstrated a one-line program which, allocating a few bytes at a time, tops out at 150 million allocations per second per core; if the memory stays in use long enough to survive a minor GC, that drops to 100 million allocations per second per core. It's using the same allocator SBCL uses for everything (except large blocks), and these performance numbers include the time for deallocation. It takes into account all of the things that make memory allocation problematic for performance.† But it does an order of magnitude more allocations per second than you're saying is possible. If you were interested in whether the things you were saying were true or not, you would have tested these claims yourself instead of continuing to post bullshit; SBCL is free software, so they are easy to test.

Even LuaJIT 2.0.5 on this same laptop manages 43 ns per allocation, 23 million per second:

    function nlist(n)
        local rv = nil
        for i = 1, n do
            rv = {i, rv}
        end
        return rv
    end

    function mnlist(m, n)
        print('m='..m..' n='..n)
        for i = 1, m do nlist(n) end
    end

    mnlist(500000, 2000)

SBCL isn't the paragon here. OCaml (3.12.1, ocamlopt -g) manages 3.4 ns per allocation (290 million allocations per second). Compiling this code

    let rec nlist n = if n = 0 then [] else n::nlist (n-1)
    let rec mnlist m n = if m = 0 then [] else (ignore (nlist n); mnlist (m-1) n)
    let m = 2000*1000 and n = 500 ;;

    print_endline ("m=" ^ (string_of_int m) ^ " n=" ^ (string_of_int n)) ;
    mnlist m n

I get a program that executes in 3.4 seconds. The machine code for `nlist` looks like this:

       0x0000000000403530 <+0>:     sub    $0x8,%rsp
       0x0000000000403534 <+4>:     cmp    $0x1,%rax
       0x0000000000403538 <+8>:     jne    0x403548 <camlTimealloc__nlist_1030+24>
       0x000000000040353a <+10>:    mov    $0x1,%rax
       0x0000000000403541 <+17>:    add    $0x8,%rsp
       0x0000000000403545 <+21>:    retq   
       0x0000000000403546 <+22>:    xchg   %ax,%ax
       0x0000000000403548 <+24>:    mov    %rax,(%rsp)
       0x000000000040354c <+28>:    add    $0xfffffffffffffffe,%rax
       0x0000000000403550 <+32>:    callq  0x403530 <camlTimealloc__nlist_1030>
    => 0x0000000000403555 <+37>:    mov    %rax,%rdi
       0x0000000000403558 <+40>:    sub    $0x18,%r15
       0x000000000040355c <+44>:    mov    0x2177fd(%rip),%rax        # 0x61ad60
       0x0000000000403563 <+51>:    cmp    (%rax),%r15
       0x0000000000403566 <+54>:    jb     0x403584 <camlTimealloc__nlist_1030+84>
       0x0000000000403568 <+56>:    lea    0x8(%r15),%rax
       0x000000000040356c <+60>:    movq   $0x800,-0x8(%rax)
       0x0000000000403574 <+68>:    mov    (%rsp),%rbx
       0x0000000000403578 <+72>:    mov    %rbx,(%rax)
       0x000000000040357b <+75>:    mov    %rdi,0x8(%rax)
       0x000000000040357f <+79>:    add    $0x8,%rsp
       0x0000000000403583 <+83>:    retq   
       0x0000000000403584 <+84>:    callq  0x411898 <caml_call_gc>
       0x0000000000403589 <+89>:    jmp    0x403558 <camlTimealloc__nlist_1030+40>

As may be apparent, the open-coded allocation here consists of these five instructions; OCaml's allocation bump pointer moves downward rather than upward, and it tags its cons cells with an 8-byte in-memory 0x800 header word instead of a 7 in the low 4 bits of the pointer:

       0x0000000000403558 <+40>:    sub    $0x18,%r15
       0x000000000040355c <+44>:    mov    0x2177fd(%rip),%rax        # 0x61ad60
       0x0000000000403563 <+51>:    cmp    (%rax),%r15
       0x0000000000403566 <+54>:    jb     0x403584 <camlTimealloc__nlist_1030+84>
       0x0000000000403568 <+56>:    lea    0x8(%r15),%rax

It's true, as you say, that the way it works is similar to "adding [small numbers] to the size of a C++ vector with a large capacity on each iteration of a loop." (It's not "adding 1" because allocations of all sizes are served from the same nursery; interspersing different open-coded allocation sizes affects performance only a little.) But just doing that doesn't save you from writing a garbage collector and implementing write barriers, or alternatively doing MLKit-style static reasoning about lifetimes the way you do to allocate things on a per-frame heap in C++.

It's also true that, as you say, "memory allocation is often huge low hanging fruit for optimization." You aren't going to get this kind of performance out of HotSpot or a malloc implementation, not even mimalloc. So if you're using one of those systems, memory allocation is an order of magnitude more expensive. And, if you're concerned about worst-case performance—latency, rather than throughput, as HFT people and AAA game programmers are—probably no kind of garbage collection is a good idea, and you may even need to avoid allocation altogether, although recent versions of HotSpot make some remarkable claims about worst-case latency, claims which may be true for all I know.

Of course, even when it comes to throughput, there is no free lunch. An allocator design so heavily optimized for fast allocation necessarily makes mutation more expensive—SBCL notoriously uses segfaults for its write barrier so that writes into the nursery are as fast as possible, a cost that would be intolerable for more mutation-oriented languages like Java or C++; and heavy mutation is generally a necessary evil if latency is important. (Also, I think there are algorithms, especially in numerical computation, where the best known mutation-based algorithms have a logarithmic speedup over the best known pure functional algorithms.)

You can find a more detailed discussion of some of the issues in https://archive.fo/itW87 (Martin Cracauer's comparison of implementing memory allocation in LLVM and SBCL, including years of experience running SBCL in production and extensive discussion of the latency–throughput tradeoff I touch on above) and the mimalloc technical report, https://www.microsoft.com/en-us/research/publication/mimallo.... The mimalloc report, among other things explains how they found that, for mimalloc, BBN-LISP-style per-page free-lists (Bobrow & Murphy 1966, AD647601, AFCRL-66-774) were faster than pointer-bumping allocation!

______

† This build of SBCL does have multithreading enabled, and the allocation benchmark takes the same amount of time running in a separate thread, but on my machine it doesn't get a very good speedup if run in multiple threads, presumably due to some kind of allocator contention. (I know that SBCL's GC has to stop all mutator threads during a collection, but the original allocation benchmark here only spends 5% of its time in the GC, so this isn't enough to explain the observed multithreading slowdown, where one thread takes 6.5 ns per allocation, while two threads each take 9.2, three take 17, and four take 29, so although this is a 4-core CPU, at that point multithreading is imposing a slowdown rather than providing a speedup. I suspect that this version of SBCL isn't providing a per-thread nursery the way modern VMs do, so allocation-heavy code like this gets serialized.)

gsg · on March 12, 2021

Yes, this has been done a few times. CDR-coding was a hardware-assisted method of unrolling a Lisp list (complicated somewhat by the need to support mutation of car and cdr) that appeared on Lisp machines, and there's a Appel/Reppy/Shao paper on unrolling linked lists in the context of Standard ML.

There's also some interesting work on flattened versions of arbitrary tree structures: https://engineering.purdue.edu/~milind/docs/ecoop17.pdf

gsg · on March 12, 2021

You can easily see by searching for 'kmalloc' (or 'malloc') at https://github.com/torvalds/linux/blob/master/include/linux/... that it does no such thing.

Here's the logic for adding a list node:

    /*
     * Insert a new entry between two known consecutive entries.
     *
     * This is only for internal list manipulation where we know
     * the prev/next entries already!
     */
    static inline void __list_add(struct list_head *new,
                                  struct list_head *prev,
                                  struct list_head *next)
    {
            if (!__list_add_valid(new, prev, next))
                    return;

            next->prev = new;
            new->next = next;
            new->prev = prev;
            WRITE_ONCE(prev->next, new);
    }

No allocation, just mutating some fields in preexisting list_head structures. Those are by convention stored as a field in whatever struct needs to be kept in the list, which is what 'intrusive' means.

cmeacham98 · on March 12, 2021

And where do you think that "new" comes from? I guess my phrasing is poor so I should have been clearer: elements in kernel LLs generally came from a kmalloc, even though the LL functions don't use a malloc themselves.

gsg · on Sept 24, 2020

BOUND is pretty slow, and requires an odd start/end pair to be placed in memory. I don't see any reason that it would be better than the usual unsigned comparison + branch that languages with bounds checking tend to use.

Besides, much of the difficulty with memory safety in C and C++ is the existence of pointers (or wrappers around them, like iterators), which do not come with an associated length. Length checking machinery can't help with that problem whether special instructions exist or not.

In short, it's doubtful that the availability of these old instructions would make any difference to the practicality of bounds checking on modern x86 machines whatsoever.

pjmlp · on Sept 24, 2020

Modern approaches take advantage of 64bit addressing modes to handle tagged pointers.

The irony that we need a Lisp Machines solution to fix C and its derived languages.

gsg · on Sept 24, 2020

That still exists though? add rax, rbx/jo overflow_error is how you do overflow checking on x86-64.

chrisseaton · on Sept 24, 2020

That doesn’t work with a lot of arithmetic, so you have to use slower arithmetic in order to use it.

gsg · on Sept 24, 2020

Sure, but that didn't change much between x86 and x86-64. Perhaps it got a little worse because the SIMD instructions aren't overflow check friendly.

gsg · on Sept 11, 2020

> As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++.

That is not the case. Guaranteed TCE requires deallocating the stack frame before jumping to the target, but that is not possible when an object is allocated in that frame and its address passed to the target function.

In C++ there is also the issue of destructors needing to run after the tail-call returns (in which case it is not really a tail-call).

C/C++ compilers can and do eliminate tail calls where possible, but there's no guarantee like you get from a compiler for a functional language.

chriswarbo · on Sept 11, 2020

> that is not possible when an object is allocated in that frame and its address passed to the target function

Ah of course, I hadn't thought about live aliases to stack-allocated data.

You're right about destructors; I agree that they're not really tail-calls (since we're processing results).

> C/C++ compilers can and do eliminate tail calls where possible, but there's no guarantee like you get from a compiler for a functional language.

Yes, I tend to say "tail-call optimisation" when it's opportunistic (e.g. GCC/Clang), and "tail-call elimination" when it's guaranteed (e.g. Scheme/ML).

gsg · on Aug 5, 2020

Interesting if somewhat opaque. I'm familiar with defunctionalisation as an alternative to closure conversion in whole program compilers and as a description of how data types are derived from the lambda calculus - never seen a category theory take on the idea. I don't seem to be able to understand the category theory part though.

The suggestion to use a combination of CPS + defunctionalisation to serialise closures is notable, since that pair of transformations gives a fairly close correspondence between a subset of the lambda calculus (plus some primitives) and abstract machine code. Some of the old-school Scheme compilers used CPS as a low-level IR for that reason.