If you have a lot of "data plane" code or other looping over data, you can see a...

atiedebee · 2026-04-10T15:00:54 1775833254

I made a program with some inline assembly and tried O3 with clang once. Because the assembly was in a loop, the compiler probably didn't have enough information on the actual code and decided to fully unroll all 16 iterations, making performance drop by 25% because the cache locality was completely destroyed. What I'm trying to say, is that loop unrolling is definitely not a guarantee for faster code in exchange for binary size

pclmulqdq · 2026-04-10T23:20:41 1775863241

Large blocks of inline assembly also destroy -O3. The compiler treats the asm statement as being essentially empty and makes decisions around it. Most inline asm is 1 instruction, so this is usually safe.