Hacker News new | past | comments | ask | show | jobs | submit login

LLMs are primarily "memory-bound" rather than "compute-bound" during normal use.

The model weights (billions of parameters) must be loaded into memory before you can use them.

Think of it like this: Even with a very fast chef (powerful CPU/GPU), if your kitchen counter (VRAM) is too small to lay out all the ingredients, cooking becomes inefficient or impossible.

Processing power still matters for speed once everything fits in memory, but it's secondary to having enough VRAM in the first place.




Transformers are typically memory-bandwidth bound during decoding. This chip is going to have a much worse memory b/w than the nvidia chips.

My guess is that these chips could be compute-bound though given how little compute capacity they have.


It's pretty close. A 3090 or 4090 has about 1TB/s of memory bandwidth, while the top Apple chips have a bit over 800GB/s. Where you'll see a big difference is in prompt processing. Without the compute power of a pile of GPUs, chewing through long prompts, code, documents etc is going to be slower.


nobody in industry is using a 4090, they are using H100s which have 3TB/s. Apple also doesn’t have any equivalent to nvlink.

I agree that compute is likely to become the bottleneck for these new Apple chips, given they only have like ~0.1% the number of flops


I chose the 3090/4090 because it seems to me that this machine could be a replacement for a workstation or a homelab rig at a similar price point, but not a $100-250k server in a datacenter. It's not really surprising or interesting that the datacenter GPUs are superior.

FWIW I went the route of "bunch of GPUs in a desktop case" because I felt having the compute oomph was worth it.


4.8TB/s on H200, 8TB/s on B200, pretty insane.


That’s wild, somehow I hadn’t seen the B200 specs before now. I wish I could have even a fraction of that!


VRAM capacity is the initial gatekeeper, then bandwidth becomes the limiting factor.


i suspect that compute actually might be the limiter for these chips before b/w, but not certain


> Transformers are typically memory-bandwidth bound during decoding.

Not in case of language models, which are typically bound by memory size rather than bandwidth.


nope


I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843


sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs


Why are industry setups considered actual while others are not?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: