Hacker News new | past | comments | ask | show | jobs | submit login

For this case, the required x86 opcode could be "rep stosX" (just write), and not "rep movsX" (copy). Anyway, on x86 processors, "rep movsX" was faster from 8086 to 80386, but since the 80486, it is faster to do assembly unrolling for reducing the jump penalty (as the "rep movsX" doesn't do unrolling, being its conditional jump a penalty). On CPUs with prefetch hardware support -e.g. SSE2-, you could still speed it up a bit more, by fetching/ordering future cache line before needed, avoiding pipeline stalls.

The shown example, that just writes to RAM, is probably just fast enough (it is harder to optimize read cases), although a bit of unrolling could reduce jump penalty impact (specially on non superscalar or in in-order superscalar CPUs -typical ARM included on handheld devices-).




While we're jumping down the architecture pedant rabbit hole, a simple loop like that will be trivially predicted, so the branch will be basically free. In addition, hardware prefetchers do a much better job at predicting linear memory access than manual prefetch instructions. On Core 2, iirc, if you have 2 or 3 L2 misses at fixed offsets either direction from each other, the hardware will automatically begin prefetching memory so it's there when you need. The problem with manual prefetch instructions is there high latency. They're best for hinting to the processor that you're about to make an unpredictable load.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: