As someone who doesn’t know that much about AI performance, why is that?

fancyfredbot · 2025-03-31T15:42:17 1743435737

The M series CPUs have very good memory bandwidth and capacity which lets them load in the billions of weights of a large LLM quickly.

Because the bottleneck to producing a single token is typically the time taken to get the weights into the FPU macs perform very well at producing additional tokens.

Producing the first token means processing the entire prompt first. With the prompt you don't need to process one token before moving on to the next because they are all given to you at once. That means loading the weights into the FPU onlu once for the entire prompt, rather than once for every token. That means the bottleneck isn't the time to get the weights to the FPU, it's the time taken to process the tokens.

Macs have comparatively low compute performance (M4 Max runs at about 1/4 the FP16 speed of the small nvidia box in this article, which itself is roughly 1/4 the speed of a 5090 GPU).

dumbmrblah · 2025-03-31T14:56:10 1743432970

Time point 1 is processing all the tokens from the original prompt.

Time point 2 is replying.

cma · 2025-03-31T14:55:45 1743432945

Next token is mostly bandwidth bound, prefill/ingest can process tokens in parallel and starts becoming more compute heavy. Next token(s) with speculative decode/draft model also becomes compute heavy since it processes several in parallel and only rolls back on mispredict.