For this case, the required x86 opcode could be "rep stosX" (just write), and no...

chadaustin · on May 26, 2010

While we're jumping down the architecture pedant rabbit hole, a simple loop like that will be trivially predicted, so the branch will be basically free. In addition, hardware prefetchers do a much better job at predicting linear memory access than manual prefetch instructions. On Core 2, iirc, if you have 2 or 3 L2 misses at fixed offsets either direction from each other, the hardware will automatically begin prefetching memory so it's there when you need. The problem with manual prefetch instructions is there high latency. They're best for hinting to the processor that you're about to make an unpredictable load.