No, it's not that it's compute-heavy, especially, it's that the model expects to...

regularfry on Jan 29, 2024 | parent | context | favorite | on: Show HN: WhisperFusion – Low-latency conversations...

No, it's not that it's compute-heavy, especially, it's that the model expects to work on 30-second samples. So if you want sub-second latency, you have to do 30 seconds worth of processing more than once a second. It just multiplies the problem up. If you can't offload it to a gpu it's painfully inefficient.

As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.

intalentive on Jan 29, 2024 [–]

For minimum latency you want a recurrent model that works in the time domain. A Mamba-like model could do it.