No, it's not that it's compute-heavy, especially, it's that the model expects to work on 30-second samples. So if you want sub-second latency, you have to do 30 seconds worth of processing more than once a second. It just multiplies the problem up. If you can't offload it to a gpu it's painfully inefficient.
As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.
As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.