Except when loops unrolls make things faster.

ants_a · on June 25, 2020

That only tends to be the case for tiny loops where loop iterator updates can be folded into addressing displacements and that frees up an execution port needed for something else.

If you don't know what you are doing unrolling is just as likely to hurt performance because you don't fit into uOP cache and get less decode bandwidth as a result. Or you increase ICache pressure on macro benchmarks and hurt real world performance. Modern cores are really good at hiding loop accounting overhead.

scoutt · on June 25, 2020

In a world where all the electronic devices are composed of x86 cores and/or have a data and/or instruction cache, it could be true.

But there are plenty of architectures out there where space-time tradeoffs (by optimized compilers) are still a thing (my PoV is from the embedded industry).