> in real systems a small simple loop will perform better due to icache pressure...

Gibbon1 · on June 25, 2020

Problem is superscalar processors the correspondence between number of instructions and speed breaks down. Partly because the processor does it's own optimization on the fly and can do multiple things in parallel.

A programmer should be careful about second guessing the compiler. And a compiler should be careful about second guessing the processor.

MaxBarraclough · on June 26, 2020

I'm not sure if you're implying this is premature optimisation. It isn't.

It's a performance-sensitive standard-library function, the kind of thing that deserves optimisation in assembly. It's also the kind of problem that can be accelerated with SIMD, but that necessarily means more complex code. That's why the standard library implementations aren't always dead simple.

Here's a pretty in-depth discussion [0]. They discuss CPU throttling, caches, and being memory-bound.

[0] https://news.ycombinator.com/item?id=18260154

jeffbee · on June 25, 2020

Only personal experience. If you look at the memcpy in llvm's libc, it was contributed by Googlers who share my experience and perspective.