Hacker News new | past | comments | ask | show | jobs | submit login

This article has a mistake. I actually ran the benchmark, and it doesn't return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot.

As I write this comment, the article's numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):

    % rm -f benchmark && make && file benchmark && ./benchmark
    c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    benchmark: Mach-O 64-bit executable arm64
    minify : 1.02483 GB/s
    validate: inf GB/s

    % rm -f benchmark && arch -x86_64 make && file benchmark && ./benchmark
    c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    benchmark: Mach-O 64-bit executable x86_64
    minify : 4.44489 GB/s
    validate: 5.3981 GB/s
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don't suspect it's running under an emulator.

Update, I re-ran with the improvements from downthread (credit messe and tedd4u):

    % rm -f benchmark && make && file benchmark && ./benchmark
    c++ -Oz -o benchmark benchmark.cpp simdjson.cpp -std=c++11 -DSIMDJSON_IMPLEMENTATION_ARM64=1
    benchmark: Mach-O 64-bit executable arm64
    minify : 6.7234 GB/s
    validate: 17.7723 GB/s
Note that my version also uses a nanosecond precision timer `clock_gettime_nsec_np(CLOCK_UPTIME_RAW)` because I was trying to debug the earlier broken version.

That puts Intel at 1.16x and 1.07x for this specific test, not the 1.8x and 3.5x claimed in the article.

Also I took a quick glance at the generated NEON for validateUtf8 and it doesn't look very well interleaved for four execution units. I bet there's still M1 perf on the table here.




Yep, looks like simdjson is defaulting to a generic fallback implementation. I added the following to the start of main:

    const simdjson::implementation *impl = simdjson::active_implementation;
    std::cout << "simdjson is optimized for " << impl->name() << "(" << impl->description() << ")" << std::endl;
When built for Intel/Rosetta, it prints:

    x86_64% ./benchmark 
    simdjson is optimized for westmere(Intel/AMD SSE4.2)
    minify : 4.44883 GB/s
    validate: 5.39216 GB/s
On arm64:

    arm64% ./benchmark
    simdjson is optimized for fallback(Generic fallback implementation)
    minify : 1.02521 GB/s
    validate: inf GB/s
simdjson's mess of CPP macros isn't properly detected ARM64. By manually setting -DSIMDJSON_IMPLEMENTATION_ARM64=1 on the command line, I got the following results:

    arm64% c++ -O3 -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    arm64% ./benchmark 
    simdjson is optimized for arm64(ARM NEON)
    minify : 6.64657 GB/s
    validate: 16.3949 GB/s
EDIT: Interestingly, compiling with -Os nets a slight improvement to the validate benchmark:

    arm64% c++ -Os -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    arm64% ./benchmark
    simdjson is optimized for arm64(ARM NEON)
    minify : 6.649 GB/s
    validate: 17.1456 GB/s


Thanks for getting to the bottom of this.

Looks like -Oz bumps validate up another few percent.

  % c++ -Oz -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
  % ./benchmark
  minify : 6.73381 GB/s
  validate: 17.8548 GB/s


Still a bit slower but much more competitive. Thanks for the additional investigation/validation!


The original article has been updated. M1 actually turns out faster on validation than Intel.

Why? We were not running the same config as the author. You have to supply twitter.json as an argument otherwise it uses the compiled binary itself (!) as the input due to off-by-1 errors in argc/argv parsing.


I agree. It looks like I made a mistake.


The same thing happened to me the first time I tried to benchmark my code on M1.

In my case I was building using Bazel. Bazel was running under Rosetta because they don't release a darwin_arm64 build yet. I didn't realize the resulting code was also built for x86-64.

I tried explicitly passing -march but the compiler rejected this, saying it was an unknown architecture. After some experimentation, it appears that when you exec clang from a Rosetta binary it puts it in a mode where it only knows how to build x86.


Both x86 and arm clang can output for the other architecture, or universal.

Pass `-arch arm64` not `-march`

You can also `clang -arch x86_64 -arch arm64` to build for both at once.

You can even go a step further and run your clang as native from bazel, via `arch -arm64 clang`.

Put it all together and you have: `arch -arm64 clang -arch x86_64 -arch arm64`.

It may seem like I'm joking but I'm not.


Related:

Why does my native application compiled on Apple Silicon sometimes build as arm64 and sometimes build as x86_64?

Why does my native arm64 application built using an x86_64 build system fail to be code signed unless I remove the previous executable?

[1] https://stackoverflow.com/questions/64830635/why-does-my-nat...

[2] https://stackoverflow.com/questions/64830671/why-does-my-nat...


Thanks for the tips. I'm unable to replicate my previous experiment re: -arch. When I compile a wrapper program that does an exec() of clang, it is able to build arm64 or x86_64 even if the wrapper is built as x86_64.

The "arch" command is handy, thanks for that.


You can build bazel for arm M1, here is my binary https://github.com/erwincoumans/bazel/releases/tag/bazel-3.7...


> That puts Intel at 1.16x and 1.07x for this specific test

That's absolutely amazing result, and shows how wrong the current information in the article is. I hope the author sees what you did and updates his page as soon as possible.


Yes. I am working on updating it.


Thank you. The whole discussion (starting from your original post) is very useful and interesting.


So the updated values on the original page are now:

> Intel/M1 ratio 1.2 0.9

> As you can see, the older Intel processor is slightly superior to the Apple M1 in the minify test.

I'd consider it as bigger news that M1 in one of the two tests chosen by the author (utf8) 10% faster than Intel, and in another (minify) only 20% slower, which is for most purposes something that most users won't even be able to notice. It's quite remarkable result. I'd surely write:

"As you can also see, in the UTF-8 validate test M1 is superior to older Intel processor, and in the minify test only 20% slower, even if Intel uses more power to calculate the result!"

-----

(Additionally I use the opportunity to thank again to u/bacon_blood who verified the initial claims and u/messe who figured out what the remaining bug in the author sources was! Great work!)

(Edit: the ratio 1.16 is from older native measurement. So I've also made an error in the previous version of this comment! I've wrongly connected that with the Rosetta 2 produced code. I've deleted that part of this message. Still the difference between 1.07 and 0.9 measured on two different setups is interesting, when another test is close enough).


Yeah, the tone of the post still reflects the original results, and the update should be at the top for anyone returning to it later.

Still, glad this was caught.


Thanks for a quick fix on a Sunday afternoon!

I’m impressed that the M1 can keep up on this SIMD-optimized code, likely at much lower temperature / power use.

And even the Rosetta numbers are pretty decent.


The post with egregious errors was also put up on a Sunday afternoon. And while we're all acting conciliatory now, it's pretty remarkable how biased the post was, the author using some clearly erroneous numbers to prove their prior, baseless claim that the "M1 chip is far inferior" in some respects, when those respects were specifically SIMD. Then becoming strangely defensive when some people rightly pointed out that ARM64 has 128-bit NEON and a number of other advantages.

Far inferior becomes....actually superior in many cases, even at SIMD.


Let’s try to be charitable, shall we? Everyone makes mistakes sometimes, even leading experts in low-level algorithm optimization. Lemire was upfront about making a mistake, and not at all defensive about it; if you are reading it that way, it’s just you.

It is clearly the case that the M1 CPU/SoC has a significant performance advantage in typical branchy single-core code, but much less advantage if any for certain kinds of heavily optimized numerics. Beyond that high-level summary, it’s good to dive into the details, and spark discussions.

Everyone is just now getting their hands on these chips, learning how to work with them, and trying to figure out how to best optimize for them.


what does "inf GB/s" mean in this circumstance? I can't figure out exactly how these compare yet; the minify number makes it look like native M1 underperforms Rosetta2.


The benchmark program itself is obviously broken on ARM, as Rosetta is jitting ARM behind the scenes, so you could write a program + compiler that emitted the same ARM as Rosetta. This means it's a problem with the program and not a problem with the M1. I'm not sure what's actually wrong with it yet.

Edit: messe found the issue in sibling thread




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: