Your result is hardly surprising as Apple M1 is a different CPU architecture usi...

fulldecent2 · on Dec 17, 2020

Thanks for your interest. If you have any advice on other instructions or M1 optimizations, I'd love to hear.

My first thought is to synchronize effort of the 8 CPU and maybe even the 8 GPU. That's +12dB right there. We have a multithreaded implementation in the project already.

exDM69 · on Dec 18, 2020

No, sorry, I don't know anything about the M1.

If you want to achieve something similar to _mm_stream_xxx, ie. bypassing the caches and causing bursts of DRAM traffic, try making some uncached/write combined memory mappings and writing to them. I don't know how this can be done in user space. You could try creating memory mapped buffers with OpenGL or Metal, with certain arguments you could get an uncached mapping.

Another option is looking at ARM instructions for memory barriers and cache flushes. ARM's selection of instructions for dealing with caches is much richer than x86's.

soheil · on Dec 17, 2020

I actually compiled it with Rosetta turned on with x86_64 arch. Let me know if you still wanna see the objdump.

exDM69 · on Dec 17, 2020

If you compile it to x86_64, that _mm intrinsic will turn into a single instruction. No need to show the disassembly.

Thanks for the info.