> 1. ARM is inherently more efficient than x86 CPUs in most tasks > 2. Nuvia and...

aurareturn · on Aug 11, 2024

>The third possibility is that they just pick a different point on the efficiency curve. You can double power consumption in exchange for a few percent higher performance, double it again for an even smaller increase.

This only makes sense if the Zen5 is actually faster in ST than the M3. In this case, the M3 is 1.24x faster and 3.4x more efficient in ST than Zen5.

AMD's Zen5 chip is just straight up slower in any curve.

>The max turbo on the i9 14900KS is 253 W. The power efficiency is bad. But it generally outperforms the M3, despite being on a significantly worse process node, because that's the trade off.

It's not a trade off that Intel wants. The 14900KS runs at 253w (sometimes 400w+) because that's the only way Intel is able to stay remotely competitive at the very high end. An M3 Max will often match a 14900KS in performance using 5-10% of the power.

AnthonyMouse · on Aug 11, 2024

> This only makes sense if the Zen5 is actually faster in ST than the M3. In this case, the M3 is 1.24x faster and 3.4x more efficient in ST than Zen5.

It makes sense if Zen5 is faster in MT, since that's when the CPUs will be power limited, and it is. For ST the performance generally isn't power-limited for either of them and then the M3 is on a newer process node.

It also depends on the benchmark. For example, Zen5 is faster in ST on Cinebench R23. It's not obvious what's going on with R24, but it's a difference in the code rather than the hardware.

The power numbers in that link also don't inspire a lot of confidence. They have two systems with the same CPU but one of them uses 119.3W and the other one uses 46.7W? Plausibly the OEMs could have configured them differently but that kind of throws out the entire premise of using the comparison to measure the efficiency of the CPUs. The number doesn't mean anything if the power consumption is being set as a configuration parameter by the OEM and the number of watts going to the display or a discrete GPU are an uncontrolled hidden variable.

> It's not a trade off that Intel wants.

It's the one they've always taken, even when they were unquestionably in the lead. They were selling up to 150W desktop processors in 2008, because people buy them, because they're faster.

Now they have to do it just to be competitive because their process isn't as good, but the process is a different thing than the ISA or the design of the CPU.

aurareturn · on Aug 11, 2024

>It also depends on the benchmark. For example, Zen5 is faster in ST on Cinebench R23. It's not obvious what's going on with R24, but it's a difference in the code rather than the hardware.

Cinebench R23 uses Intel Embree underneath. It's hand optimized for AVX instruction set and poorly translated to NEON. It's not even clear if it has any NEON optimization.

dagmx · on Aug 12, 2024

R23 doesn’t have the same SIMD optimizations available for ARM as it does for x86.

Anyone using R23 instead of R24, is putting arm at a disadvantage. Notebookcheck is often called out for this and haven’t really addressed why they stick with R23 beyond not wanting to redo tests for older hardware. They are by far the outlier for performance numbers and why the discussion around performance gets muddied.

AnthonyMouse · on Aug 12, 2024

> R23 doesn’t have the same SIMD optimizations available for ARM as it does for x86.

The single-thread benchmark is SIMD-heavy?

Now it just sounds like Cinebench ST is a useless benchmark because it's putting a parallelizable SIMD workload on a single core. In real life you'd always be running those multi-threaded, whereas the reason people care about ST performance is for the serialized branch-heavy spaghetti code that inherently only runs on one core. "Run the SIMD code, but clamp it to a single thread" is a garbage proxy for that.

dagmx · on Aug 12, 2024

Yes, there’s no difference between the single and multi core benchmark other than how many threads get spun up.

I’m not sure why you’re trying to equate simd with parallelization. Tbh, a lot of your response seems odd to me because it’s making several incorrect assumptions.

You can’t really escape parallelization with how any modern core works, even on a single core. You may have certain operations process concurrently depending on how the cores resources are available at a given time and what is needed.

Regardless, SIMD isn’t concurrency. It’s batching.

There’s still significant benefit to having SIMD on a single threaded task. There’s a lot of thread overhead to using multiple cores to do something, whereas SIMD lets you effectively batch things on a single core.

AnthonyMouse · on Aug 12, 2024

SIMD workloads generally imply that you're doing the same operation repeatedly. It's literally in the name; single instruction, multiple data. There are occasional cases where that happens but doesn't parallelize well. TLS is probably a good example because you might have to encrypt a network packet and it's big enough to benefit from SIMD but not big enough that the overhead of splitting it across cores is worth it.

But most of the time if you're doing the same operation repeatedly you'll benefit from using more cores. Even for TLS, the client might not split the individual connection across multiple cores, but the server is going to handle multiple clients at once in parallel. Heavy workloads like video encoding make this even more apparent. In general the things that benefit from SIMD are parallel tasks that benefit from multiple cores.

Compare this with, say, a browser running JavaScript in a single tab. There is nothing to put on another core, you don't know what instructions will be executed next until you get there. This is where people actually care about single-thread performance, and where processors achieve it by using branch prediction etc. But these exercise very different parts of the CPU than SIMD-heavy workloads. The latter can easily fill the execution units of a wide processor that would be stymied by the former.

dagmx · on Aug 12, 2024

This feels like a really absurd stretch of trying to discern SIMD away from stuff like standard integer and float operations.

Maybe in the early 90s, but they’re such a part of processor design that you can’t realistically avoid them.

Especially for rendering, which is matrix math heavy, you’d have to design something completely bespoke to avoid it. SIMD is a natural necessity for rendering with any kind of performance.

And because SIMD is such a part of every mainstream processor, it’s very important that benchmarks show how well they perform.

I also don’t understand why you think a JavaScript runtime wouldn’t use SIMD. V8 can make use of SIMD, whether directly targeted or indirectly via the compiler that compiled the runtime itself.

If you want to stress very specific parts of a processor, then use something like SPEC. Cinebench is meant to be a realistic reflection of production rendering.

AnthonyMouse · on Aug 12, 2024

> Especially for rendering, which is matrix math heavy, you’d have to design something completely bespoke to avoid it. SIMD is a natural necessity for rendering with any kind of performance.

Well sure, but rendering is a classically parallel operation which is regularly implemented as threaded.

> I also don’t understand why you think a JavaScript runtime wouldn’t use SIMD. V8 can make use of SIMD, whether directly targeted or indirectly via the compiler that compiled the runtime itself.

JavaScript runtimes are executing code, so they'll implement the whole gamut and their execution will depend on what kind of code it actually is. But the common JavaScript code, and the kind presumably being tested in JavaScript benchmarks because it's what people care about, isn't implementing a video encoder using SIMD. It's manipulating DOM objects and parsing short pieces of text input, which is branch-heavy code with lots of indirection and very little use of SIMD if any.

> If you want to stress very specific parts of a processor, then use something like SPEC. Cinebench is meant to be a realistic reflection of production rendering.

Which is kind of my point. Production rendering is going to be threaded and max out all the cores, which is Cinebench MT. "CineBench ST" is measuring something that nobody does in real life and doesn't even really correlate with the things people actually do.

It doesn't represent real threaded workloads (which optimally run on many low-clocked cores, not one high-clocked one) nor real serialized workloads (which are full of conditional jumps and cache misses).

dagmx · on Aug 12, 2024

Rendering is parallel, yes, but it also makes heavy use of SIMD to accelerate the operations per thread. One does not obviate the other.

In the most trivial case, sure, the JavaScript runtimes won’t compile to use SIMD but there’s lot of cases where they will as well as part of their JIT. I think you’re trivializing how they work.

And back to the main point, it doesn’t really matter if you believe Cinebench reflects real world rendering. The fact is that R23 uses SIMD for x86, but not for ARM. R24 rectifies that. Both R23 and R24 use the same rendering code path regardless of running in single or multi threaded mode.

So using R23 as benchmarks for efficiency and performance will naturally benefit x86 significantly. There’s a reason none of the people who push the “AMD is almost the same” use the fairer benchmark to do so. R24 really highlights the actual discrepancy when both are given a fair playing field.

AnthonyMouse · on Aug 12, 2024

> The fact is that R23 uses SIMD for x86, but not for ARM. R24 rectifies that. Both R23 and R24 use the same rendering code path regardless of running in single or multi threaded mode.

But that's not the issue. Even if Cinebench R23 isn't a valid comparison, Zen5 also faster for Cinebench R24 MT.

Cinebench ST (R24 or R23) turns out to be a silly benchmark, because nobody in real life is going to artificially limit their renderer to one thread, but a renderer limited to one thread is also a bad proxy for real single-threaded workloads.

What it's mostly telling you is how wide the CPU is. Which only matters for real single-threaded code if the CPU can find enough instruction-level parallelism to exploit (which that test doesn't probe), and only matters for multi-threaded code to the extent that the processor can maintain that instruction density without becoming limited by thermals/cache/memory/etc. when the code is running on all the cores, which is the thing the MT benchmark tests.

dagmx · on Aug 12, 2024

Again, I think your logic here is flawed. It’s still valuable to know how a single core behaves for a rendering work load.

I’ve worked professionally in feature film production. We test single threaded performance all the time.

It tells us the performance characteristics of the nodes on our farm, and lets us more accurately get a picture of how jobs will scale and/or be packed.

It’s not uncommon to have a single thread rendering, for example to update part of an image while an artist works on other parts.

It’s not just a test of how wide the chip is. It also tests things like how it actually handles the various instruction sets from a real world codebase. Not all processors that are equally “wide” (not the right term but whatever) handle AVX the same, and you need to know how a single core behaves for that. It’s also useful to see how the cores actually behave on their own so you can eliminate the overhead of thread synchronization and system scheduling affecting you.

Wytwwww · on Aug 11, 2024

> An M3 Max will often match a 14900KS in performance using 5-10% of the power.

On Cinebench and Passmark CPU the 14900K is 50-60% faster so I'm not sure that's true.

merb · on Aug 14, 2024

The 14900ks and ryzen 7950 can basically turbo way over 253w as long as the chip stays cool. Both chips have two (or more?) eco modes. It’s actually super silly. Because the 125w mode is actually most often good enough