Can't we have an audio codec that first sends a model of the particular voice, a...

neckro23 · on June 13, 2024

This is actually an old idea, minus the AI angle (1930s). It’s what voders and vocoders were originally designed for, before Kraftwerk et al. found out you can use them to make cool robot voices.

roywiggins · on June 13, 2024

You need a bunch of bandwidth upfront for that, which you might not have, and enough compute available at the other end to reconstruct it, which you really might not have.

amelius · on June 13, 2024

Regarding your first point, how about reserving a small percentage of the bandwidth for a model that improves incrementally?

wildzzz · on June 13, 2024

You're adding more complexity to both the transmitter and receiver. I'd be pretty pissed if I had to endure unintelligible speech for a few minutes until the model was downloaded enough to be able to hear my friend. I'd also be a little pissed if I had to store massive models for everyone in my call log. Also both devices need to be able to run this model. If you are regularly talking over a shit signal, you're probably not going to be able to afford the latest flagship phone that has the hardware necessary to run it (which is exactly what the article touches on). The ideal codec takes up almost no bandwidth, sounds like you're sitting next to the caller, and runs on the average budget smartphone/PC. The issue is that you aren't going to be able to get one of these things so you choose a codec that best balances complexity, quality, and bandwidth given the situation. Link quality improves? Improve your voice quality by increasing the codec bitrate or switching to another less complex one to save battery. If both devices are capable of running a ML codec, then use that to improve quality and fit within the given bandwidth.