LLMs are primarily "memory-bound" rather than "compute-bound" during normal use....

whimsicalism · 2025-03-05T15:44:55 1741189495

Transformers are typically memory-bandwidth bound during decoding. This chip is going to have a much worse memory b/w than the nvidia chips.

My guess is that these chips could be compute-bound though given how little compute capacity they have.

Gracana · 2025-03-05T15:54:06 1741190046

It's pretty close. A 3090 or 4090 has about 1TB/s of memory bandwidth, while the top Apple chips have a bit over 800GB/s. Where you'll see a big difference is in prompt processing. Without the compute power of a pile of GPUs, chewing through long prompts, code, documents etc is going to be slower.

whimsicalism · 2025-03-05T15:58:49 1741190329

nobody in industry is using a 4090, they are using H100s which have 3TB/s. Apple also doesn’t have any equivalent to nvlink.

I agree that compute is likely to become the bottleneck for these new Apple chips, given they only have like ~0.1% the number of flops

Gracana · 2025-03-05T17:09:17 1741194557

I chose the 3090/4090 because it seems to me that this machine could be a replacement for a workstation or a homelab rig at a similar price point, but not a $100-250k server in a datacenter. It's not really surprising or interesting that the datacenter GPUs are superior.

FWIW I went the route of "bunch of GPUs in a desktop case" because I felt having the compute oomph was worth it.

_zoltan_ · 2025-03-05T17:25:06 1741195506

4.8TB/s on H200, 8TB/s on B200, pretty insane.

Gracana · 2025-03-05T17:40:18 1741196418

That’s wild, somehow I hadn’t seen the B200 specs before now. I wish I could have even a fraction of that!

gatienboquet · 2025-03-05T16:04:47 1741190687

VRAM capacity is the initial gatekeeper, then bandwidth becomes the limiting factor.

whimsicalism · 2025-03-05T16:05:43 1741190743

i suspect that compute actually might be the limiter for these chips before b/w, but not certain

cubefox · 2025-03-05T16:10:01 1741191001

> Transformers are typically memory-bandwidth bound during decoding.

Not in case of language models, which are typically bound by memory size rather than bandwidth.

whimsicalism · 2025-03-05T16:13:29 1741191209

cubefox · 2025-03-05T22:41:13 1741214473

I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843

whimsicalism · 2025-03-05T22:50:00 1741215000

sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs

cubefox · 2025-03-05T23:27:43 1741217263

Why are industry setups considered actual while others are not?