Hacker News new | past | comments | ask | show | jobs | submit login

What are you talking about? Why would you multiply the bandwidth? 8 4090s is still 1000GB/s. While the M2 Ultra is 800GB/s with a top of 192GB VRAM. Metal can access ~155GB, so you need a bit more, but your comparison makes absolutely zero sense.



There are different ways to run LLMs on multiple GPUs, one of them (called tensor parallelism) in low batch scenarios would be multiplying bandwidth between different GPUs. So no, 8 4090s is not 1000 GB/s.


you've heard something and are regurgitating it without fully understanding it.


I’m developing inference engine, so I actually do understand how it works. As well as other types of parallelism and how exactly they do different trade offs


let me know how is the PCIe bandwidth treating you


Since we’re talking about small batch sizes PCIe bandwidth isn’t as important - intermediate hidden state is magnitude smaller than weights.


When you have 8 GPUs, you can use more than 1 at a time.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: