My company, Modulate, is building "voice skins" for exactly this purpose: customizing your voice in chat for online videogames!
It's a really interesting technical challenge. On the one hand, changing your timbre to another human voice is much more complicated than basic pitch shifting, so we ended up using deep neural networks for a kind of simultaneous speech-recognition and speech-synthesis approach (though training this system to preserve e.g. the emotional complexity of your input speech while still changing out the timbre convincingly is difficult - we use adversarial training on the raw audio waveform, which is powerful but also pretty much unknown territory compared to images).
On the other hand, it's important to run with very low latency on your device while you're playing, which means that we can't simply "throw the biggest network we can" at the problem. So we have a tradeoff between model latency, which is easy to characterize, and audio quality / voice skin plausibility, which is pretty ambiguous and subjective.
Finally, as this kind of tech improves the potential for misuse becomes an important problem, so we need to build in protections (like watermarking the audio) that can help prevent fraud while not compromising the speed of the algorithm or the quality of the output audio.
> Finally, as this kind of tech improves the potential for misuse becomes an important problem, so we need to build in protections (like watermarking the audio) that can help prevent fraud while not compromising the speed of the algorithm or the quality of the output audio.
Legitamate question: (absolutely not rhetorical, I don't know the answer!)
What the potential misuse of a gender voice changer? If someone wants to change their perceived gender in a video game (presumably to avoid harassment), should that be detectable? If both genders should ideally receive identical treatment anyway, is there any harm in swapping?
I should definitely expand a bit: the voice skins that we're building give you a _specific_ person's timbre (or you can do some cool "voice space" vector manipulations to combine timbres). The cool application here is being able to sound like a character or celebrity in the game, but the risk of misuse for having a specific person's vocal cords is much greater than that for just swapping your gender.
That said, there _are_ some interesting things to be careful around, even for changing your gender, or age, or other basic variables around your voice. We're mostly worried about what impact this would have for communities built around these kinds of commonalities: for example, is it okay for a child to masquerade as an adult in an adults-only social group? I don't think there's a clear answer to all of those situations - but until we see more use of realistic voice skins in the real world, we're playing it safe and building in these kinds of tools!
It's a really interesting technical challenge. On the one hand, changing your timbre to another human voice is much more complicated than basic pitch shifting, so we ended up using deep neural networks for a kind of simultaneous speech-recognition and speech-synthesis approach (though training this system to preserve e.g. the emotional complexity of your input speech while still changing out the timbre convincingly is difficult - we use adversarial training on the raw audio waveform, which is powerful but also pretty much unknown territory compared to images).
On the other hand, it's important to run with very low latency on your device while you're playing, which means that we can't simply "throw the biggest network we can" at the problem. So we have a tradeoff between model latency, which is easy to characterize, and audio quality / voice skin plausibility, which is pretty ambiguous and subjective.
Finally, as this kind of tech improves the potential for misuse becomes an important problem, so we need to build in protections (like watermarking the audio) that can help prevent fraud while not compromising the speed of the algorithm or the quality of the output audio.