I speak Brazilian Portuguese natively. I chose to record my voice saying a specific sentence and to "translate" it to Brazilian Portuguese using the exact same sentence. I was very pleased to find out that I became a Mineiro from the countryside, one of the coolest accents in Brazil!
The Brazilian Portuguese model is a bit of an extreme showcase (and thus really cool!), as it was trained on a single speaker (entirely recorded by the main author of the paper, Edresson Casanova, who's Brazilian).
The fact that it can do multi-lingual voice cloning at all in that case is already surprising. You can find more details in the project page [0] and paper [1]. And here's the corpus. [2]
There was a thread a while back about the need for "accent correction," meaning that native speakers with one accent could more easily consume content in the same accent. It looks like the technology exists now! This is worth money. If you find an accent that many people really dislike, the odds are that's it's also very difficult for people to understand that accent (until they are accustomed to it).
The way it works is, a model is trained of all possible voices. Then your specific voice is projected into latent space.
That's why it can mimic your voice with only a few seconds of audio. It's not making a model, but rather using an existing model.
It may seem like a pedantic distinction, but it's why the model isn't as worrisome as it seems. It can't target you specifically, just the average voice near yours.
It's closer to a really talented parrot than a model that can impersonate you on command. I suspect if you try it out, you'll be surprised it's so far off from your actual voice.
The English->Portuguese sample sounded nothing like me at all except for one syllable where it sounded like it was playing back a brief snip of what I had recorded.
The English->French version did a little bit better, it sounded like the voice had been influenced by mine in some small way.
English->English (saying a very different sentence to what I recorded) was pretty impressive though.
It's a little hopeless in the long term. Eventually these models will get so good that they could work with only a small recording of text vs the huge amount of transcribed audio currently needed.
I don't think so. At ANZ bank, you authorise every call to the bank by saying the phrase: "My voice confirms my identity". (You repeat this multiple times when registering)
Sadly no. It is becoming more and more common at banks and investing companies- I had the option to do it when dealing with a small company a previous employer used to manage employee 401k accounts.
It's a valid concern to not want to give a random website a workable voice model. Just because you've talked on the phone or used speech-to-text before, doesn't make that concern invalid.
Upon reflection, I'm not so sure that our usage of our voice in real life can be dismissed so easily as a concern, even if the comment was intended to dismiss concerns about this website.
Probably the folks who could best use this nefariously are the folks we already know, who have much greater availability to our voice. Those folks are in the best situation to capitalize on a working voice model to, say, call our manager, bank branch, or local emergency operator. A random website would have to go to some effort to accumulate the needed information to use our voice for much, whereas someone who already knows us could have us fired for the contents of a phone call to the manager, up on charges for prank-calling 911, or worse.
But this website is not even asking for identifying information..
yes it could figure things out, but as privacy conscious HN readers we have VPNs and such right? =p
Just need a facebook pixel or a google font? Some company (already known for dark practices) could hide behind a "random website" to get more data. /conspiracy
And if they are using other techniques like browser fingerprinting or other techniques big tech uses to de-anonymize users then suddenly they have something to tie your voice to. Maybe the risk is low but I prefer to error on the side of caution.
Tomorrow a paid tool or a costly hidden company will allow anyone to get statement in your voice (based on sample). How you are going to proof, that it is not you?
Fake calls to your relatives in your voice or even fake video with your face and voice asking for money! or illegal activities.
Few years later a company will come and say we can detect if it's fake or not pay $10,000 for solution, or get ready to be in prison. Oops! legal system doesn't accept this as a proof, now what? Welcome to the prison.
Both companies are making money, and you are paying by money and your life.
I can see the government banning using voice as a password. I can't see it banning the tech. The criminals will use the tech regardless of if it's banned. Looks like we'll need person to person authentication for our relatives soon.
It doesn't have to be illegal but I think some defensive regulation here is smart. Things people are concerned about may already be illegal. Imitation, identity theft, slander and so on. Think about the new layer it adds to domestic disputes and criminal investigations.
Perhaps a solution is a sound fingerprint requirement for voice imitation software so that it's easily identifiable in court if it's an imitation voice.
It's somewhat of a new frontier, imagine during a divorce proceeding your ex-partner fabricates voice recordings of you threatening the kids so you don't get custody, how do you protect yourself against that, how to you prove that's what happened? Soon enough it'll just be an app on their phone that they use to record your voice during a discussion, then later spits out a sound file of you saying whatever they want you to say. That's clearly a socially dangerous tool.
Pretty interesting! I tried this both English -> French and French -> English.
English -> French seemed to work best, with the AI output have a very similar timbre to my real voice. Not hyperrealistic for me, but decent enough given I gave it a ~20s sample.
French -> English was less good in terms of the timbre and pitch of the voice---way higher than my real voice. It did have a bit of a Canadian accent, though, which is funny because I speak French with a Quebec accent. Maybe that's what I would sound like if I had a Canadian accent in English?
Funnily, I (native American English speaker who learned French in QC, and whose accent in French indicates this) tried it both ways. I think the accent is basically built in both ways, which makes sense, although it would be more interesting if it based your accent in the output off the phonology in the input.
I am French and I did try it, recording my voice in English (I have a thick French accent to English speaking ears, ok for French ones). And the result back in French was kind of good even it did sound almost like me with a slight American English accent.
Have you investigated whether this is useful for language learning? Presumably it ought to be easier to try to emulate (and compare and contrast) speech in "your own voice" (with a native accent) than someone else's. Another useful feature to this end might be to emulate how your voice sounds to you (rather than other people); not sure how difficult that is.
Indeed! I tried it with some French and was impressed. After recording in English and synthesizing a short sentence I tried to record and speak using the same intonation/speed as the generated French audio. It matches almost perfectly. Except of course for the bg music I don’t think anyone could discern which one was real and which one was fake. It didn’t work for all sentences, and there were some obvious glitches, but for the pieces where it did it was quite freaky. Also, hearing the French sentence in my own voice made it quite easy to pronounce it correctly. When I try this using for example the Google Translate TTS it’s much much harder.
This is one of those ideas that seems obvious when you hear it and also I'm pissed I didn't think about it. It also seems like a key component to a universal translator. This + VTT + a phone sounds like it'd put UN translators out of business (:) yeah I know, nuance probbably matters there).
This is amazing. I can't wait until this is used to dub TV shows, so we get the original actors' voices, especially for shows like Squid Game that had such terrible dubs.
It also misses that vocal inflection and timing is part of what makes a solid dub. Even if the translation was amazing, with all the subtleties of the language conferred somehow, you still have to get that right for it to be convincing. Otherwise you could end up with solid dialog such as Pride and Prejudice, as delivered by Tommy Wiseau or Christopher Walken or something ("Those. Who do not. ComPLAIN. Are never pitied.")
Cool, it's impressive how much can it do with a short sample, although this seems like an easy way for end users to deep fake their friends / enemies saying something.
Maybe the solution is to have a randomly generated paragraph of text to read which expires in short amount of time. So you can't predict it and you don't have enough time to splice together a fake reading from something else.
The problem with any anti abuse measure is someone can create another project which does not have any of this. There are a handful of projects which can do pretty good voice synthesis right now. It would be about as easy as getting a consensus for all photo editing tools to place a watermark on the image to prevent abuse.
As someone who actually speaks two languages - gave it a voice sample in Polish, then used it to synthesize the voice in English - sounds absolutely nothing like me. Meh.
My 26 second training input perhaps wasn't enough. The result sounded like someone else. Is the result some kind of merger of my voice and a native speaker's?
Similarity depends on many factors: recording quality, which language you're synthesizing in (models trained on more speakers do better), and diversity of prosody in your recording. Try recording for a bit longer and "acting out" a bit in your tone, that tends to give me interesting results :)
I appreciate the effort here, but it almost feels like this is hopeless as it seems so many groups are able to build voice synthesis right now that the tech has fallen in to the common persons hand and some of them won't make any effort to stop abuse.
Maybe if we can get watermarked stuff out first and the average person gets up to speed with what tech can do, we can all adjust our expectations before the real wave of abuse hits.
Same here. It even mispronounced the basic french words, and inserted some background music similar to what you can hear on the CDs with exercises that come with those "foreign language for beginners" textbooks.
To help prevent malicious use, consider presenting the user with specific (randomly-generated) text to read aloud, and check (with speech to text) that they actually read that, instead of allowing them to say whatever they want.
That will help ensure that this is only being used by the person visiting the web page.
(That will only help with the hosted version, of course, not if you make the model code/weights available. I didn't generate this idea myself but also can't remember where I saw it. I think it was from someone offering a similar service.)
This will be great for foreign movies. While I still prefer subtitles, for those who watch with dubs, it’ll be amazing to hear the actor’s “real” voice.
I spoke for about a minute in English, having no idea what is the ideal length for it to properly figure out my voice. The result sounded like someone else completely. There was also some strange music in the background, which made me think that it was playing back a recording of a real person speaking! A real person who's not me.
An interesting reflection is how quickly research around TTS/STT has progressed. I remember reading [0] thinking we were a long ways away. And things will get way better with multi-task learning and multi-modal learning in the coming years (or months really).
In fact, just a year after this post was written, CoquiAI started their open source projects [1].
Interesting. I like the addition of music to make sure it's not just a raw voice sample. The output I get seems to be a mix of a native speaker and my voice, because my (thick) accent is being filtered out.
I suppose that if I ever take proper English pronunciation classes, I now know what to strive for.
Off topic, but this reminded me: What ever happened to that thing that Google demoed where its robots would call restaurants and make reservations for you? Did that ever find its way into Android, or another product?
I put my voice in in English and asked for English back. The output mostly sounded the same, but the interesting thing was that it had some music playing in the background, like the ambient kind you might hear in a YT video while a narrator talks.
I recorded my voice in English and it converted into French,
but I heard after converting it to french voice, a music was heard in the sound after my voice was played , what was it.
Btw idea is really cool, its like how will you speak in same tone in other languages.
Very cool! If I were looking for a side project, I'd extend this, add a DeepL integration for automating translations, add some voice models for other languages/people and wrap it as a mobile app where people could pay to unlock the voice models.
Discourages low-hanging, hit-and-run usage that's likely to get their site shut down.
If someone wants to fake a statement there are already 100 ways to do it. Not making their servers the ones doing the deed puts a meaningful barrier in place for more casual misuse. And for serious cases like impersonation on a large scale, the resources are there to likely do better than this instant feedback model can.
You're free to enter any input sentence you want in the text box.
The input sentence generally should be in the language you selected from the dropdown. For example, if the dropdown has "French" selected you could enter the text "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
Clicking "Submit" then generates a TTS reading of the sentence you input in the language selected from the dropdown.
For fun you can mix and match. In other words, select a language from the drop down and enter text in the text box not in the language selected from the dropdown. (For example, the dropdown could have "French" selected and the sentence could be "O say can you see, by the dawn's early light". This gives interesting results, it sounds as if a native French speaker is speaking English.)