Hacker News new | past | comments | ask | show | jobs | submit login

Is batched inference for LLMs memory bound? My understanding is that sufficiently large batched matmuls will be compute bound and flash attention has mostly removed the memory bottleneck in the attention computation. If so, the value proposition here -- as well as with other memorymaxxing startups like Groq -- is primarily on the latency side of things. Though my personal impression is that latency isn't really a huge issue right now, especially for text. Even OpenAI's voice models are (purportedly) able to be served with a latency which is a low multiple of network latency, and I expect there is room for improvement here as this is essentially the first generation of real-time voice LLMs.



Batched inference will increase your overall throughput, but each user will still be seeing the original throughput number. It's not necessarily a memory vs compute issue in the same way training is. It's more a function of the auto-regressive nature of transformer inference as far as I understand which presents unique challenges.

If you have an H100 doing 100 tokens/sec and you batch 1000 requests, you might be able to get to 100K tok/sec but each user's request will still be outputting 100 tokens/sec which will make the speed of the response stream the same. So if your output stream speed is slow, batching might not improve user experience, even if you can get a higher chip utilization / "overall" throughput.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: