Hacker News new | past | comments | ask | show | jobs | submit login

No, it's not that it's compute-heavy, especially, it's that the model expects to work on 30-second samples. So if you want sub-second latency, you have to do 30 seconds worth of processing more than once a second. It just multiplies the problem up. If you can't offload it to a gpu it's painfully inefficient.

As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.




For minimum latency you want a recurrent model that works in the time domain. A Mamba-like model could do it.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: