I think this benchmark deserves a little drill down on how the ARM v. Intel compilers implement their SIMD output. If the M1 lacks 256-but SIMD, what exactly is being measured here?
What's measured is how a SIMD-optimised routine differs between AVX and NEON, under the assumption that most of the difference would come down to the difference between 256b (AVX) and 128b (NEON) SIMD. In a previous post[0], lemire confirmed that NEON was competitive with SSE (which is also 128b) comparing older µarch (Intel's Skylake versus Apple's A12).