Your result is hardly surprising as Apple M1 is a different CPU architecture using the ARM instruction set. It is unlikely that the CPU circuitry related to this particular function would be similar to Intel x86. This might not even work with an AMD x86 CPU.
Out of curiosity, can you dump the relevant disassembly of the square_am_signal function (using `$ objdump -d mybinary`)?
Note that this instruction (_mm_stream_si128) is a SSE instruction that will write 128 bits to memory, more or less bypassing the cache (as long as you write whole cache lines). This instruction is highly specific to the memory architecture in Intel x86 CPUs. When compiling for ARM, I expect that this will just be compiled to a regular store instruction.
Thanks for your interest. If you have any advice on other instructions or M1 optimizations, I'd love to hear.
My first thought is to synchronize effort of the 8 CPU and maybe even the 8 GPU. That's +12dB right there. We have a multithreaded implementation in the project already.
If you want to achieve something similar to _mm_stream_xxx, ie. bypassing the caches and causing bursts of DRAM traffic, try making some uncached/write combined memory mappings and writing to them. I don't know how this can be done in user space. You could try creating memory mapped buffers with OpenGL or Metal, with certain arguments you could get an uncached mapping.
Another option is looking at ARM instructions for memory barriers and cache flushes. ARM's selection of instructions for dealing with caches is much richer than x86's.
Out of curiosity, can you dump the relevant disassembly of the square_am_signal function (using `$ objdump -d mybinary`)?
Note that this instruction (_mm_stream_si128) is a SSE instruction that will write 128 bits to memory, more or less bypassing the cache (as long as you write whole cache lines). This instruction is highly specific to the memory architecture in Intel x86 CPUs. When compiling for ARM, I expect that this will just be compiled to a regular store instruction.