Hacker Newsnew | past | comments | ask | show | jobs | submit | wolfgangK's commentslogin

Only those who don't care/know about prompt processing speed are buying Macs for LLM inference.


Don't know and don't care are definitely things that I could be, but it also makes sense if they want to keep lookups private.


Even 40 tokens per second is plenty enough for real time usage. The average person reads at ~4 words per second, 40 tokens per second is going to be 15-20 words per second.

Even useful models like gemma3 27b are hitting 22 t/s on 4bit quants.

You aren't going to be reformatting gigabytes of PDFs or anything, but for a lot of common use cases, those speeds are fine.


For LLM inference, I don't think the PCIe bandwidth matters much and a GPU could improve greatly the prompt processing speed.


The Strix Halo iGPU is quite special, like the Apple iGPU it has such good memory bandwidth to system RAM that it manages to improve both prompt processing and token generation compared to pure CPU inference. You really can't say that about the average iGPU or low-end dGPU: usually their memory bandwidth is way too anemic, hence the CPU wins when it comes to emitting tokens.


Only if your entire model fits the GPU VRAM.

To me this reads like "if you can afford those 256GB VRAM GPUs, you don't need PCIe bandwidth!"


No, that's not true. Prompt processing just needs attention tensors in VRAM, the MLP weights aren't needed for the heavy calculations that a GPU speeds up. (After attention, you only need to pass the activations from GPU to system RAM, which is about ~40KB so you're not very limited here).

That's pretty small.

Even Deepseek R1 0528 685b only has like ~16GB of attention weights. Kimi K2 with 1T parameters has 6168951472 attention params, which means ~12GB.

It's pretty easy to do prompt processing for massive models like Deepseek R1, Kimi K2, or Qwen 3 235b with only a single Nvidia 3090 gpu. Just do --n-cpu-moe 99 in llama.cpp or something similar.


If you can't, your performance will likely be abysmal though, so there's almost no middle ground for the LLM workload.


Yeah, I think so. Once the whole model is on the GPU (potentially slower start-up), there really isn't much traffic between the GPU and the motherboard. That's how I think about it. But mostly saying this as I'm interested in being corrected if I'm wrong.


Indeed, recent Flash Attention is a pain point for non CUDA.


The idea is presumably that you would "sell" at an artificially low price.


Which is illegal if you do it for tax evasion.


and legal if you’re just helping your buddy out


Not sure if the tax office agrees.


> The Soviets had the […]first woman,[…]

That is quite the claim !


Claim? It's a matter of history. Her name is Valentina Tereshkova, and she's still alive. [0]

[0]: https://en.wikipedia.org/wiki/Valentina_Tereshkova


Woosh! It's clearly a joke. "first woman" vs "first woman in space".


You forgot the "/s", or do you actually believe that it's capitalism's fault is a mother taking care of her children is "unpaid labor" ?


"is hard" ≠ "sucks"


Most interesting ! Would you mind sharing the prompt and the resulting CLAUDE.md file ?

Thx !


IMO, it would be more interesting to have a 3-way comparison of price/performance between DeepSeek 671b running on :

1. M3 Ultra 512 2. AMD Epyc (which Gen ? AVX512 and DDR5 might make a difference in both performance and cost , Gen 4 or Gen 5 have 8 or 9 t/s https://github.com/ggml-org/llama.cpp/discussions/11733 ) 2. AMD Epyc + 4090 or 5090 running KTransformers (over 10 t/s decode ? https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...)


DeepSeek is not a model.Which model did you use (v3 ? R1 ? a distillation ?) at which quantization ?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: