This article has a mistake. I actually ran the benchmark, and it doesn't return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot.
As I write this comment, the article's numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don't suspect it's running under an emulator.
Update, I re-ran with the improvements from downthread (credit messe and tedd4u):
Note that my version also uses a nanosecond precision timer `clock_gettime_nsec_np(CLOCK_UPTIME_RAW)` because I was trying to debug the earlier broken version.
That puts Intel at 1.16x and 1.07x for this specific test, not the 1.8x and 3.5x claimed in the article.
Also I took a quick glance at the generated NEON for validateUtf8 and it doesn't look very well interleaved for four execution units. I bet there's still M1 perf on the table here.
x86_64% ./benchmark
simdjson is optimized for westmere(Intel/AMD SSE4.2)
minify : 4.44883 GB/s
validate: 5.39216 GB/s
On arm64:
arm64% ./benchmark
simdjson is optimized for fallback(Generic fallback implementation)
minify : 1.02521 GB/s
validate: inf GB/s
simdjson's mess of CPP macros isn't properly detected ARM64. By manually setting -DSIMDJSON_IMPLEMENTATION_ARM64=1 on the command line, I got the following results:
arm64% c++ -O3 -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
arm64% ./benchmark
simdjson is optimized for arm64(ARM NEON)
minify : 6.64657 GB/s
validate: 16.3949 GB/s
EDIT: Interestingly, compiling with -Os nets a slight improvement to the validate benchmark:
arm64% c++ -Os -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
arm64% ./benchmark
simdjson is optimized for arm64(ARM NEON)
minify : 6.649 GB/s
validate: 17.1456 GB/s
The original article has been updated. M1 actually turns out faster on validation than Intel.
Why? We were not running the same config as the author. You have to supply twitter.json as an argument otherwise it uses the compiled binary itself (!) as the input due to off-by-1 errors in argc/argv parsing.
The same thing happened to me the first time I tried to benchmark my code on M1.
In my case I was building using Bazel. Bazel was running under Rosetta because they don't release a darwin_arm64 build yet. I didn't realize the resulting code was also built for x86-64.
I tried explicitly passing -march but the compiler rejected this, saying it was an unknown architecture. After some experimentation, it appears that when you exec clang from a Rosetta binary it puts it in a mode where it only knows how to build x86.
Thanks for the tips. I'm unable to replicate my previous experiment re: -arch. When I compile a wrapper program that does an exec() of clang, it is able to build arm64 or x86_64 even if the wrapper is built as x86_64.
> That puts Intel at 1.16x and 1.07x for this specific test
That's absolutely amazing result, and shows how wrong the current information in the article is. I hope the author sees what you did and updates his page as soon as possible.
So the updated values on the original page are now:
> Intel/M1 ratio 1.2 0.9
> As you can see, the older Intel processor is slightly superior to the Apple M1 in the minify test.
I'd consider it as bigger news that M1 in one of the two tests chosen by the author (utf8) 10% faster than Intel, and in another (minify) only 20% slower, which is for most purposes something that most users won't even be able to notice. It's quite remarkable result. I'd surely write:
"As you can also see, in the UTF-8 validate test M1 is superior to older Intel processor, and in the minify test only 20% slower, even if Intel uses more power to calculate the result!"
-----
(Additionally I use the opportunity to thank again to u/bacon_blood who verified the initial claims and u/messe who figured out what the remaining bug in the author sources was! Great work!)
(Edit: the ratio 1.16 is from older native measurement. So I've also made an error in the previous version of this comment! I've wrongly connected that with the Rosetta 2 produced code. I've deleted that part of this message. Still the difference between 1.07 and 0.9 measured on two different setups is interesting, when another test is close enough).
The post with egregious errors was also put up on a Sunday afternoon. And while we're all acting conciliatory now, it's pretty remarkable how biased the post was, the author using some clearly erroneous numbers to prove their prior, baseless claim that the "M1 chip is far inferior" in some respects, when those respects were specifically SIMD. Then becoming strangely defensive when some people rightly pointed out that ARM64 has 128-bit NEON and a number of other advantages.
Far inferior becomes....actually superior in many cases, even at SIMD.
Let’s try to be charitable, shall we? Everyone makes mistakes sometimes, even leading experts in low-level algorithm optimization. Lemire was upfront about making a mistake, and not at all defensive about it; if you are reading it that way, it’s just you.
It is clearly the case that the M1 CPU/SoC has a significant performance advantage in typical branchy single-core code, but much less advantage if any for certain kinds of heavily optimized numerics. Beyond that high-level summary, it’s good to dive into the details, and spark discussions.
Everyone is just now getting their hands on these chips, learning how to work with them, and trying to figure out how to best optimize for them.
what does "inf GB/s" mean in this circumstance? I can't figure out exactly how these compare yet; the minify number makes it look like native M1 underperforms Rosetta2.
The benchmark program itself is obviously broken on ARM, as Rosetta is jitting ARM behind the scenes, so you could write a program + compiler that emitted the same ARM as Rosetta. This means it's a problem with the program and not a problem with the M1. I'm not sure what's actually wrong with it yet.
As I write this comment, the article's numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don't suspect it's running under an emulator.Update, I re-ran with the improvements from downthread (credit messe and tedd4u):
Note that my version also uses a nanosecond precision timer `clock_gettime_nsec_np(CLOCK_UPTIME_RAW)` because I was trying to debug the earlier broken version.That puts Intel at 1.16x and 1.07x for this specific test, not the 1.8x and 3.5x claimed in the article.
Also I took a quick glance at the generated NEON for validateUtf8 and it doesn't look very well interleaved for four execution units. I bet there's still M1 perf on the table here.