LLMs are primarily "memory-bound" rather than "compute-bound" during normal use.
The model weights (billions of parameters) must be loaded into memory before you can use them.
Think of it like this: Even with a very fast chef (powerful CPU/GPU), if your kitchen counter (VRAM) is too small to lay out all the ingredients, cooking becomes inefficient or impossible.
Processing power still matters for speed once everything fits in memory, but it's secondary to having enough VRAM in the first place.
It's pretty close. A 3090 or 4090 has about 1TB/s of memory bandwidth, while the top Apple chips have a bit over 800GB/s. Where you'll see a big difference is in prompt processing. Without the compute power of a pile of GPUs, chewing through long prompts, code, documents etc is going to be slower.
I chose the 3090/4090 because it seems to me that this machine could be a replacement for a workstation or a homelab rig at a similar price point, but not a $100-250k server in a datacenter. It's not really surprising or interesting that the datacenter GPUs are superior.
FWIW I went the route of "bunch of GPUs in a desktop case" because I felt having the compute oomph was worth it.
sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs
The model weights (billions of parameters) must be loaded into memory before you can use them.
Think of it like this: Even with a very fast chef (powerful CPU/GPU), if your kitchen counter (VRAM) is too small to lay out all the ingredients, cooking becomes inefficient or impossible.
Processing power still matters for speed once everything fits in memory, but it's secondary to having enough VRAM in the first place.