More

wolfgangK · 2025-08-30T23:38:42 1756597122

Only those who don't care/know about prompt processing speed are buying Macs for LLM inference.

esseph · 2025-08-30T23:42:24 1756597344

Don't know and don't care are definitely things that I could be, but it also makes sense if they want to keep lookups private.

com2kid · 2025-08-31T00:56:47 1756601807

Even 40 tokens per second is plenty enough for real time usage. The average person reads at ~4 words per second, 40 tokens per second is going to be 15-20 words per second.

Even useful models like gemma3 27b are hitting 22 t/s on 4bit quants.

You aren't going to be reformatting gigabytes of PDFs or anything, but for a lot of common use cases, those speeds are fine.

wolfgangK · 2025-08-09T21:40:04 1754775604

For LLM inference, I don't think the PCIe bandwidth matters much and a GPU could improve greatly the prompt processing speed.

zozbot234 · 2025-08-10T10:40:15 1754822415

The Strix Halo iGPU is quite special, like the Apple iGPU it has such good memory bandwidth to system RAM that it manages to improve both prompt processing and token generation compared to pure CPU inference. You really can't say that about the average iGPU or low-end dGPU: usually their memory bandwidth is way too anemic, hence the CPU wins when it comes to emitting tokens.

ElectricalUnion · 2025-08-09T21:44:56 1754775896

Only if your entire model fits the GPU VRAM.

To me this reads like "if you can afford those 256GB VRAM GPUs, you don't need PCIe bandwidth!"

jychang · 2025-08-10T10:45:34 1754822734

No, that's not true. Prompt processing just needs attention tensors in VRAM, the MLP weights aren't needed for the heavy calculations that a GPU speeds up. (After attention, you only need to pass the activations from GPU to system RAM, which is about ~40KB so you're not very limited here).

That's pretty small.

Even Deepseek R1 0528 685b only has like ~16GB of attention weights. Kimi K2 with 1T parameters has 6168951472 attention params, which means ~12GB.

It's pretty easy to do prompt processing for massive models like Deepseek R1, Kimi K2, or Qwen 3 235b with only a single Nvidia 3090 gpu. Just do --n-cpu-moe 99 in llama.cpp or something similar.

tgma · 2025-08-10T05:09:51 1754802591

If you can't, your performance will likely be abysmal though, so there's almost no middle ground for the LLM workload.

jgalt212 · 2025-08-09T21:45:07 1754775907

Yeah, I think so. Once the whole model is on the GPU (potentially slower start-up), there really isn't much traffic between the GPU and the motherboard. That's how I think about it. But mostly saying this as I'm interested in being corrected if I'm wrong.

wolfgangK · 2025-08-09T21:24:27 1754774667

Indeed, recent Flash Attention is a pain point for non CUDA.

wolfgangK · 2025-08-08T06:11:22 1754633482

The idea is presumably that you would "sell" at an artificially low price.

1718627440 · 2025-08-08T07:37:13 1754638633

Which is illegal if you do it for tax evasion.

randyrand · 2025-08-09T08:30:22 1754728222

and legal if you’re just helping your buddy out

1718627440 · 2025-08-09T16:42:35 1754757755

Not sure if the tax office agrees.

wolfgangK · 2025-08-08T06:03:56 1754633036

> The Soviets had the […]first woman,[…]

That is quite the claim !

cheaprentalyeti · 2025-08-08T07:19:06 1754637546

Claim? It's a matter of history. Her name is Valentina Tereshkova, and she's still alive. [0]

[0]: https://en.wikipedia.org/wiki/Valentina_Tereshkova

MattPalmer1086 · 2025-08-08T07:48:45 1754639325

Woosh! It's clearly a joke. "first woman" vs "first woman in space".

wolfgangK · 2025-07-15T23:07:05 1752620825

You forgot the "/s", or do you actually believe that it's capitalism's fault is a mother taking care of her children is "unpaid labor" ?

wolfgangK · 2025-07-15T23:05:25 1752620725

"is hard" ≠ "sucks"

wolfgangK · 2025-05-09T13:28:19 1746797299

Most interesting ! Would you mind sharing the prompt and the resulting CLAUDE.md file ?

Thx !

wolfgangK · 2025-03-06T01:19:15 1741223955

IMO, it would be more interesting to have a 3-way comparison of price/performance between DeepSeek 671b running on :

1. M3 Ultra 512 2. AMD Epyc (which Gen ? AVX512 and DDR5 might make a difference in both performance and cost , Gen 4 or Gen 5 have 8 or 9 t/s https://github.com/ggml-org/llama.cpp/discussions/11733 ) 2. AMD Epyc + 4090 or 5090 running KTransformers (over 10 t/s decode ? https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...)

wolfgangK · 2025-02-07T14:42:32 1738939352

DeepSeek is not a model.Which model did you use (v3 ? R1 ? a distillation ?) at which quantization ?