You're comparing two different things. The compute level you’re talking about on...

llm_nerd · 2025-03-05T15:19:23 1741187963

Using the NPU numbers grossly overstates the AI performance of the Apple Silicon hardware, so they're actually giving Apple the benefit of the doubt.

Most AI training and inference (including generative AI) is bound by large scale matrix MACs. That's why nvidia fills their devices with enormous numbers of tensor cores and Apple / Qualcomm et al are adding NPUs, filling largely the same gap. Only nvidia's not only are a magnitude+ more performant, they've massively more flexible (in types and applications), usable for training and inference, while Apple's is only even useful for a limited set of inference tasks (due to architecture and type limits).

Apple can put the effort in and making something actually competitive with nvidia, but this isn't it.

dagmx · 2025-03-05T15:48:41 1741189721

Care to share the TOPs numbers for the Apple GPUs and show how this would “grossly overstate” the numbers?

Apple won’t compete with NVIDIA, I’m not arguing that. But your opening line will only make sense if you can back up the numbers and the GPU performance is lower than the ANE TOPS.

llm_nerd · 2025-03-05T16:01:27 1741190487

Tensor / neural cores are very easy to benchmark and give a precise number because they do a single well-defined thing at a large scale. So GPU numbers are less common and much more use-specific.

However the M2 Ultra GPU is estimated, with every bit of compute power working together, at about 26 TOPS.

dagmx · 2025-03-05T16:04:15 1741190655

Could you provide a link for that TOPS count? (And specifically TOPs with comparable unit sizes since NVIDIA and Apple did not use the same units till recently)

The only similar number I can find is for TFLOPS vs TOPS

Again I’m not saying the GPU will be comparable to an NVIDIA one, but that the comparison point isn’t sensible in the comments I originally replied to.

bigyabai · 2025-03-05T15:14:28 1741187668

> After all, the 5090 alone is multiple times the wattage of this SoC.

FWIW, normalizing the wattages (or even underclocking the GPU) will still give you an Nvidia advantage most days. Apple's GPU designs are closer to AMD's designs than Nvidia's, which means they omit a lot of AI accelerators to focus on a less-LLM-relevent raster performance figure.

Yes, the GPU is faster than the NPU. But Apple's GPU designs haven't traditionally put their competitors out of a job.

dagmx · 2025-03-05T15:24:11 1741188251

M2 Ultra is ~250W (averaging various reports since Apple don’t publish) for the entire SoC.

5090 is 575W without the CPU.

You’d have to cut the Nvidia to a quarter and then find a comparable CPU to normalize the wattage for an actual comparison.

I agree that Apple GPUs aren’t putting the dedicated GPU companies in danger on the benchmarks, but they’re also not really targeting it? They’re in completely different zones on too many fronts to really compare.

bigyabai · 2025-03-05T15:35:08 1741188908

Well, select your hardware of choice and see for yourself then: https://browser.geekbench.com/opencl-benchmarks

> but they’re also not really targeting it?

That's fine, but it's not an excuse to ignore the power/performance ratio.

dagmx · 2025-03-05T15:44:47 1741189487

But I’m not ignoring the power/performance ratio? If anything, you are doing that by handwaving away the difference.

Give me a comparable system build where the NVIDIA GPU + any CPU of your choice is running at the same wattage as an M2 Ultra, and outperforms it on average. You’d get 150W for the GPU and 150W for the CPU.

Again, you can’t really compare the two. They’re inherently different systems unless you only care about singular metrics.

NorwegianDude · 2025-03-05T15:46:09 1741189569

No, I'm not. I'm comparing the TOPS of the M3 Ultra and the tensor cores of the RTX 5090.

If not, what is the TOPS of the GPU, and why isn't apple talking about it if there is more performance hidden somewhere? Apple states 18 TOPS for the M3 Max. And why do you think Apple added the neural engine, if not to accelerate compute?

The power draw is quite a bit higher, but it's still much more efficient as the performance is much higher.

dagmx · 2025-03-05T15:53:38 1741190018

The ANE and tensor cores are not comparable though. One is literally meant for low cost inference while the others are meant for acceleration of training.

If you squint, yeah they look the same, but so does the microcontroller on the GPU and a full blown CPU. They’re fundamentally different purposes, architectures and scale of use.

The ANE can’t even really be used directly. Apple heavily restricts the use via CoreML APIs for inference. It’s only usable for smaller, lightweight models.

If you’re comparing to the tensor cores, you really need to compare against the GPU which is what gets used by apples ml frameworks such as MLX for training etc.

It will still be behind the NVIDIA GpU, but not by anywhere near the same numbers.

llm_nerd · 2025-03-05T16:20:30 1741191630

>The ANE and tensor cores are not comparable though

They're both built to do the most common computation in AI (both training and inference), which is multiply and accumulate of matrices - A * B + C. The ANE is far more limited because they decided to spend a lot less silicon space on it, focusing on low-power inference of quantized models. It is fantastically useful for a lot of on-device things like a lot of the photo features (e.g. subject detection, text extraction, etc).

And yes, you need to use CoreML to access it because it's so limited. In the future Apple will absolutely, with 100% certainty, make an ANE that is as flexible and powerful as tensor cores, and they force you through CoreML because it will automatically switch to using it (where now you submit a job to CoreML and for many it will opt to use the CPU/GPU instead, or a combination thereof. It's an elegant, forward thinking implementation). Their AI performance and credibility will greatly improve when they do.

>you really need to compare against the GPU

From a raw performance perspective, the ANE is capable of more matrix multiply/accumulates than the GPU is on Apple Silicon, it's just limited to types and contexts that make it unsuitable for training, or even for many inference tasks.

NorwegianDude · 2025-03-05T16:47:01 1741193221

So now the TOPS are not comparable because M3 is much slower than an Nvidia GPU? That's not how comparisons work.

My numbers are correct, the M3 Ultra has around 1 % of the TOPS performance of a RTX 5090.

Comparing against the GPU would look even worse for apple. Do you think Apple added the neural engine just for fun? This is exactly what the neural engine is there for.

dagmx · 2025-03-05T16:58:05 1741193885

You’re completely missing the point. The ANE is not equivalent as a component to the tensor cores. It has nothing to do with comparison of TOPs but as what they’re intended for.

Try and use the ANE in the same way you would use the tensor cores. Hint: you can’t, because the hardware and software will actively block you.

They’re meant for fundamentally different use cases and power loads. Even apples own ML frameworks do not use the ANE for anything except inference.