This article has a mistake. I actually ran the benchmark, and it doesn't return ...

messe · on Dec 13, 2020

Yep, looks like simdjson is defaulting to a generic fallback implementation. I added the following to the start of main:

    const simdjson::implementation *impl = simdjson::active_implementation;
    std::cout << "simdjson is optimized for " << impl->name() << "(" << impl->description() << ")" << std::endl;

When built for Intel/Rosetta, it prints:

    x86_64% ./benchmark 
    simdjson is optimized for westmere(Intel/AMD SSE4.2)
    minify : 4.44883 GB/s
    validate: 5.39216 GB/s

On arm64:

    arm64% ./benchmark
    simdjson is optimized for fallback(Generic fallback implementation)
    minify : 1.02521 GB/s
    validate: inf GB/s

simdjson's mess of CPP macros isn't properly detected ARM64. By manually setting -DSIMDJSON_IMPLEMENTATION_ARM64=1 on the command line, I got the following results:

    arm64% c++ -O3 -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    arm64% ./benchmark 
    simdjson is optimized for arm64(ARM NEON)
    minify : 6.64657 GB/s
    validate: 16.3949 GB/s

EDIT: Interestingly, compiling with -Os nets a slight improvement to the validate benchmark:

    arm64% c++ -Os -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    arm64% ./benchmark
    simdjson is optimized for arm64(ARM NEON)
    minify : 6.649 GB/s
    validate: 17.1456 GB/s

tedd4u · on Dec 13, 2020

Thanks for getting to the bottom of this.

Looks like -Oz bumps validate up another few percent.

  % c++ -Oz -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
  % ./benchmark
  minify : 6.73381 GB/s
  validate: 17.8548 GB/s

creddit · on Dec 13, 2020

Still a bit slower but much more competitive. Thanks for the additional investigation/validation!

tedd4u · on Dec 13, 2020

The original article has been updated. M1 actually turns out faster on validation than Intel.

Why? We were not running the same config as the author. You have to supply twitter.json as an argument otherwise it uses the compiled binary itself (!) as the input due to off-by-1 errors in argc/argv parsing.

dlemire · on Dec 13, 2020

I agree. It looks like I made a mistake.

haberman · on Dec 13, 2020

The same thing happened to me the first time I tried to benchmark my code on M1.

In my case I was building using Bazel. Bazel was running under Rosetta because they don't release a darwin_arm64 build yet. I didn't realize the resulting code was also built for x86-64.

I tried explicitly passing -march but the compiler rejected this, saying it was an unknown architecture. After some experimentation, it appears that when you exec clang from a Rosetta binary it puts it in a mode where it only knows how to build x86.

bacon_blood · on Dec 13, 2020

Both x86 and arm clang can output for the other architecture, or universal.

Pass `-arch arm64` not `-march`

You can also `clang -arch x86_64 -arch arm64` to build for both at once.

You can even go a step further and run your clang as native from bazel, via `arch -arm64 clang`.

Put it all together and you have: `arch -arm64 clang -arch x86_64 -arch arm64`.

It may seem like I'm joking but I'm not.

shepmaster · on Dec 13, 2020

Why does my native arm64 application built using an x86_64 build system fail to be code signed unless I remove the previous executable?

[1] https://stackoverflow.com/questions/64830635/why-does-my-nat...

[2] https://stackoverflow.com/questions/64830671/why-does-my-nat...

haberman · on Dec 13, 2020

Thanks for the tips. I'm unable to replicate my previous experiment re: -arch. When I compile a wrapper program that does an exec() of clang, it is able to build arm64 or x86_64 even if the wrapper is built as x86_64.

The "arch" command is handy, thanks for that.

erwincoumans · on Dec 13, 2020

You can build bazel for arm M1, here is my binary https://github.com/erwincoumans/bazel/releases/tag/bazel-3.7...

acqq · on Dec 13, 2020

> That puts Intel at 1.16x and 1.07x for this specific test

That's absolutely amazing result, and shows how wrong the current information in the article is. I hope the author sees what you did and updates his page as soon as possible.

dlemire · on Dec 13, 2020

Yes. I am working on updating it.

kergonath · on Dec 14, 2020

Thank you. The whole discussion (starting from your original post) is very useful and interesting.

acqq · on Dec 13, 2020

So the updated values on the original page are now:

> Intel/M1 ratio 1.2 0.9

> As you can see, the older Intel processor is slightly superior to the Apple M1 in the minify test.

I'd consider it as bigger news that M1 in one of the two tests chosen by the author (utf8) 10% faster than Intel, and in another (minify) only 20% slower, which is for most purposes something that most users won't even be able to notice. It's quite remarkable result. I'd surely write:

"As you can also see, in the UTF-8 validate test M1 is superior to older Intel processor, and in the minify test only 20% slower, even if Intel uses more power to calculate the result!"

-----

(Additionally I use the opportunity to thank again to u/bacon_blood who verified the initial claims and u/messe who figured out what the remaining bug in the author sources was! Great work!)

(Edit: the ratio 1.16 is from older native measurement. So I've also made an error in the previous version of this comment! I've wrongly connected that with the Rosetta 2 produced code. I've deleted that part of this message. Still the difference between 1.07 and 0.9 measured on two different setups is interesting, when another test is close enough).

macintux · on Dec 13, 2020

Yeah, the tone of the post still reflects the original results, and the update should be at the top for anyone returning to it later.

Still, glad this was caught.

jacobolus · on Dec 13, 2020

Thanks for a quick fix on a Sunday afternoon!

I’m impressed that the M1 can keep up on this SIMD-optimized code, likely at much lower temperature / power use.

And even the Rosetta numbers are pretty decent.

bfgoodrich · on Dec 14, 2020

The post with egregious errors was also put up on a Sunday afternoon. And while we're all acting conciliatory now, it's pretty remarkable how biased the post was, the author using some clearly erroneous numbers to prove their prior, baseless claim that the "M1 chip is far inferior" in some respects, when those respects were specifically SIMD. Then becoming strangely defensive when some people rightly pointed out that ARM64 has 128-bit NEON and a number of other advantages.

Far inferior becomes....actually superior in many cases, even at SIMD.

jacobolus · on Dec 14, 2020

Let’s try to be charitable, shall we? Everyone makes mistakes sometimes, even leading experts in low-level algorithm optimization. Lemire was upfront about making a mistake, and not at all defensive about it; if you are reading it that way, it’s just you.

It is clearly the case that the M1 CPU/SoC has a significant performance advantage in typical branchy single-core code, but much less advantage if any for certain kinds of heavily optimized numerics. Beyond that high-level summary, it’s good to dive into the details, and spark discussions.

Everyone is just now getting their hands on these chips, learning how to work with them, and trying to figure out how to best optimize for them.

lilactown · on Dec 13, 2020

what does "inf GB/s" mean in this circumstance? I can't figure out exactly how these compare yet; the minify number makes it look like native M1 underperforms Rosetta2.

bacon_blood · on Dec 13, 2020

The benchmark program itself is obviously broken on ARM, as Rosetta is jitting ARM behind the scenes, so you could write a program + compiler that emitted the same ARM as Rosetta. This means it's a problem with the program and not a problem with the M1. I'm not sure what's actually wrong with it yet.

Edit: messe found the issue in sibling thread