Show HN: Clone your voice and speak a foreign language

SwiftyBug · on Jan 3, 2022

I speak Brazilian Portuguese natively. I chose to record my voice saying a specific sentence and to "translate" it to Brazilian Portuguese using the exact same sentence. I was very pleased to find out that I became a Mineiro from the countryside, one of the coolest accents in Brazil!

reubenmorais · on Jan 3, 2022

The Brazilian Portuguese model is a bit of an extreme showcase (and thus really cool!), as it was trained on a single speaker (entirely recorded by the main author of the paper, Edresson Casanova, who's Brazilian).

The fact that it can do multi-lingual voice cloning at all in that case is already surprising. You can find more details in the project page [0] and paper [1]. And here's the corpus. [2]

[0] https://edresson.github.io/YourTTS/

[1] https://arxiv.org/abs/2112.02418

[2] https://edresson.github.io/TTS-Portuguese-Corpus/

bernardom · on Jan 4, 2022

This is very cool. I recorded myself in Portuguese -> Portuguese and got the same result.

I also did Pt -> En and sounded like... me speaking English, though with some artifacts. VERY cool.

garfieldnate · on Jan 4, 2022

There was a thread a while back about the need for "accent correction," meaning that native speakers with one accent could more easily consume content in the same accent. It looks like the technology exists now! This is worth money. If you find an accent that many people really dislike, the odds are that's it's also very difficult for people to understand that accent (until they are accustomed to it).

actually_a_dog · on Jan 3, 2022

You spoke Portuguese into it and it just changed your accent? That's kinda cool.

themodelplumber · on Jan 3, 2022

I can speak just enough to know how I sound, and was surprised to hear that accent too :D

tambourine_man · on Jan 3, 2022

I'm curious but a bit afraid to test it out.

The idea of having a model of my voice out there that can say whatever is written in a text box is scary.

sillysaurusx · on Jan 4, 2022

FWIW, it's not quite a model of your voice.

The way it works is, a model is trained of all possible voices. Then your specific voice is projected into latent space.

That's why it can mimic your voice with only a few seconds of audio. It's not making a model, but rather using an existing model.

It may seem like a pedantic distinction, but it's why the model isn't as worrisome as it seems. It can't target you specifically, just the average voice near yours.

It's closer to a really talented parrot than a model that can impersonate you on command. I suspect if you try it out, you'll be surprised it's so far off from your actual voice.

jrumbut · on Jan 4, 2022

I took the leap.

The English->Portuguese sample sounded nothing like me at all except for one syllable where it sounded like it was playing back a brief snip of what I had recorded.

The English->French version did a little bit better, it sounded like the voice had been influenced by mine in some small way.

English->English (saying a very different sentence to what I recorded) was pretty impressive though.

Gigachad · on Jan 4, 2022

It's a little hopeless in the long term. Eventually these models will get so good that they could work with only a small recording of text vs the huge amount of transcribed audio currently needed.

alfiedotwtf · on Jan 3, 2022

My Voice Is My Passport Verify Me

reustle · on Jan 4, 2022

Schwab Bank makes you say "At Schwab, my voice is my password" before every customer support call.

ljm · on Jan 4, 2022

Voice-captchas where you repeat some corporate slogan is a dystopian future I want no part of.

Hitton · on Jan 4, 2022

MCDONALD'S! https://www.reddit.com/r/ABoringDystopia/comments/7k6v05/son...

dawnerd · on Jan 4, 2022

Seriously? I’d run away from that in a heartbeat.

yeetaccount4 · on Jan 4, 2022

Is this humor?

beanaroo · on Jan 4, 2022

I don't think so. At ANZ bank, you authorise every call to the bank by saying the phrase: "My voice confirms my identity". (You repeat this multiple times when registering)

trompetenaccoun · on Jan 4, 2022

Oh boy...

zdragnar · on Jan 4, 2022

Sadly no. It is becoming more and more common at banks and investing companies- I had the option to do it when dealing with a small company a previous employer used to manage employee 401k accounts.

At least there, it was optional.

smrtinsert · on Jan 4, 2022

That film gives me warm "I'm so glad I'm a techie" vibes. Something about the 90s bay area that is so perfect.

clan · on Jan 4, 2022

Setec Astronomy

Baeocystin · on Jan 4, 2022

Cooties Rat...

kingcharles · on Jan 4, 2022

"Computer. Activate self-destruct sequence, voice authorization Picard delta five."

chrischen · on Jan 4, 2022

If all it takes is a short sample of your voice then you're probably already screwed. May as well have fun with it!

mikotodomo · on Jan 4, 2022

I didn't think of that. Luckily I haven't tried it out yet.

colecut · on Jan 3, 2022

Should you never speak to be sure you aren't recorded?

daenz · on Jan 3, 2022

Appeal to ridicule.

It's a valid concern to not want to give a random website a workable voice model. Just because you've talked on the phone or used speech-to-text before, doesn't make that concern invalid.

syntheticnature · on Jan 4, 2022

Upon reflection, I'm not so sure that our usage of our voice in real life can be dismissed so easily as a concern, even if the comment was intended to dismiss concerns about this website.

Probably the folks who could best use this nefariously are the folks we already know, who have much greater availability to our voice. Those folks are in the best situation to capitalize on a working voice model to, say, call our manager, bank branch, or local emergency operator. A random website would have to go to some effort to accumulate the needed information to use our voice for much, whereas someone who already knows us could have us fired for the contents of a phone call to the manager, up on charges for prank-calling 911, or worse.

colecut · on Jan 4, 2022

But this website is not even asking for identifying information.. yes it could figure things out, but as privacy conscious HN readers we have VPNs and such right? =p

k_ · on Jan 4, 2022

Just need a facebook pixel or a google font? Some company (already known for dark practices) could hide behind a "random website" to get more data. /conspiracy

14 · on Jan 4, 2022

And if they are using other techniques like browser fingerprinting or other techniques big tech uses to de-anonymize users then suddenly they have something to tie your voice to. Maybe the risk is low but I prefer to error on the side of caution.

imapeopleperson · on Jan 3, 2022

This is a perfect example of when the law shouldn’t be so far behind the tech.

creato · on Jan 3, 2022

Exactly which part of this do you think should be illegal?

shanlalit · on Jan 4, 2022

Tomorrow a paid tool or a costly hidden company will allow anyone to get statement in your voice (based on sample). How you are going to proof, that it is not you?

Fake calls to your relatives in your voice or even fake video with your face and voice asking for money! or illegal activities.

Few years later a company will come and say we can detect if it's fake or not pay $10,000 for solution, or get ready to be in prison. Oops! legal system doesn't accept this as a proof, now what? Welcome to the prison.

Both companies are making money, and you are paying by money and your life.

asiachick · on Jan 4, 2022

What's your suggested solution?

I can see the government banning using voice as a password. I can't see it banning the tech. The criminals will use the tech regardless of if it's banned. Looks like we'll need person to person authentication for our relatives soon.

creato · on Jan 4, 2022

> How you are going to proof, that it is not you?

If technology like this is plausible, then the recording shouldn't be considered a statement by me in the first place.

People are just going to have to learn not to trust audio. People adapted to photoshop, they'll adapt to this.

ajuc · on Jan 4, 2022

it's possible to forge paper signatures since forever yet it's still used

taneq · on Jan 4, 2022

Yeah but history has shown that people do not, in fact, so that.

ehnto · on Jan 4, 2022

It doesn't have to be illegal but I think some defensive regulation here is smart. Things people are concerned about may already be illegal. Imitation, identity theft, slander and so on. Think about the new layer it adds to domestic disputes and criminal investigations.

Perhaps a solution is a sound fingerprint requirement for voice imitation software so that it's easily identifiable in court if it's an imitation voice.

It's somewhat of a new frontier, imagine during a divorce proceeding your ex-partner fabricates voice recordings of you threatening the kids so you don't get custody, how do you protect yourself against that, how to you prove that's what happened? Soon enough it'll just be an app on their phone that they use to record your voice during a discussion, then later spits out a sound file of you saying whatever they want you to say. That's clearly a socially dangerous tool.

xiii1408 · on Jan 4, 2022

Pretty interesting! I tried this both English -> French and French -> English.

English -> French seemed to work best, with the AI output have a very similar timbre to my real voice. Not hyperrealistic for me, but decent enough given I gave it a ~20s sample.

French -> English was less good in terms of the timbre and pitch of the voice---way higher than my real voice. It did have a bit of a Canadian accent, though, which is funny because I speak French with a Quebec accent. Maybe that's what I would sound like if I had a Canadian accent in English?

mod50ack · on Jan 4, 2022

Funnily, I (native American English speaker who learned French in QC, and whose accent in French indicates this) tried it both ways. I think the accent is basically built in both ways, which makes sense, although it would be more interesting if it based your accent in the output off the phonology in the input.

bengalister · on Jan 3, 2022

I am French and I did try it, recording my voice in English (I have a thick French accent to English speaking ears, ok for French ones). And the result back in French was kind of good even it did sound almost like me with a slight American English accent.

patrec · on Jan 4, 2022

Have you investigated whether this is useful for language learning? Presumably it ought to be easier to try to emulate (and compare and contrast) speech in "your own voice" (with a native accent) than someone else's. Another useful feature to this end might be to emulate how your voice sounds to you (rather than other people); not sure how difficult that is.

jstsch · on Jan 4, 2022

Indeed! I tried it with some French and was impressed. After recording in English and synthesizing a short sentence I tried to record and speak using the same intonation/speed as the generated French audio. It matches almost perfectly. Except of course for the bg music I don’t think anyone could discern which one was real and which one was fake. It didn’t work for all sentences, and there were some obvious glitches, but for the pieces where it did it was quite freaky. Also, hearing the French sentence in my own voice made it quite easy to pronounce it correctly. When I try this using for example the Google Translate TTS it’s much much harder.

grogenaut · on Jan 3, 2022

This is one of those ideas that seems obvious when you hear it and also I'm pissed I didn't think about it. It also seems like a key component to a universal translator. This + VTT + a phone sounds like it'd put UN translators out of business (:) yeah I know, nuance probbably matters there).

forgotmyoldacc · on Jan 3, 2022

This is called end-to-end speech translation, and has been around since 2017. Here's an article from 2019: https://www.technologyreview.com/2019/05/20/103054/google-ai...

Ninjinka · on Jan 4, 2022

This is amazing. I can't wait until this is used to dub TV shows, so we get the original actors' voices, especially for shows like Squid Game that had such terrible dubs.

var_cw · on Jan 4, 2022

Don't you think thats more of a translation problem rather than how well it was spoken?

lostcolony · on Jan 4, 2022

It also misses that vocal inflection and timing is part of what makes a solid dub. Even if the translation was amazing, with all the subtleties of the language conferred somehow, you still have to get that right for it to be convincing. Otherwise you could end up with solid dialog such as Pride and Prejudice, as delivered by Tommy Wiseau or Christopher Walken or something ("Those. Who do not. ComPLAIN. Are never pitied.")

alonmln · on Jan 3, 2022

Cool, it's impressive how much can it do with a short sample, although this seems like an easy way for end users to deep fake their friends / enemies saying something.

kdavis · on Jan 3, 2022

Currently we’re looking at possible solutions, see for example here[1]. If you have suggestions, feel free to chime in!

In the demo we specifically disallowed bulk uploads to hinder such abuses.

[1] https://github.com/coqui-ai/TTS/discussions/1036

tiborsaas · on Jan 3, 2022

I tested it with your comment: https://sndup.net/mghy/ :)

It's also a new possibility to somewhat personalize the text to speech engines. The above example is not really close to my voice.

Philip-J-Fry · on Jan 3, 2022

Maybe the solution is to have a randomly generated paragraph of text to read which expires in short amount of time. So you can't predict it and you don't have enough time to splice together a fake reading from something else.

Gigachad · on Jan 4, 2022

The problem with any anti abuse measure is someone can create another project which does not have any of this. There are a handful of projects which can do pretty good voice synthesis right now. It would be about as easy as getting a consensus for all photo editing tools to place a watermark on the image to prevent abuse.

gambiting · on Jan 4, 2022

As someone who actually speaks two languages - gave it a voice sample in Polish, then used it to synthesize the voice in English - sounds absolutely nothing like me. Meh.

sxv · on Jan 3, 2022

My 26 second training input perhaps wasn't enough. The result sounded like someone else. Is the result some kind of merger of my voice and a native speaker's?

reubenmorais · on Jan 3, 2022

Similarity depends on many factors: recording quality, which language you're synthesizing in (models trained on more speakers do better), and diversity of prosody in your recording. Try recording for a bit longer and "acting out" a bit in your tone, that tends to give me interesting results :)

fnord77 · on Jan 4, 2022

"At Schwab, my voice is my password" [1]

[1] https://www.schwab.com/voice-id

echelon · on Jan 4, 2022

That's got to be among the worst ideas I've ever seen.

https://fakeyou.com/tts/result/TR:eyfam30e255zxy69vn6a7z7yn9...

IanCal · on Jan 3, 2022

Very interesting! Is the music an intentional blended track or an artifact of generation?

_josh_meyer_ · on Jan 3, 2022

very much intentional.

Background music makes misuse/abuse less likely (both intentional and unintentional)

Read more here about in our open discussion: https://github.com/coqui-ai/TTS/discussions/1036

Gigachad · on Jan 4, 2022

I appreciate the effort here, but it almost feels like this is hopeless as it seems so many groups are able to build voice synthesis right now that the tech has fallen in to the common persons hand and some of them won't make any effort to stop abuse.

Maybe if we can get watermarked stuff out first and the average person gets up to speed with what tech can do, we can all adjust our expectations before the real wave of abuse hits.

marcan_42 · on Jan 4, 2022

You can probably run the output through Spleeter[1] and get rid of the background music very easily. Just throw more AI at the problem...

It's very hard to curb intentional misuse.

[1] https://github.com/deezer/spleeter

pcarolan · on Jan 3, 2022

This is incredibly impressive and does a great job of capturing my voice. Well done!

carbonx · on Jan 3, 2022

I just tried it and it sounds nothing at all like me. shrug

kubb · on Jan 3, 2022

Same here. It even mispronounced the basic french words, and inserted some background music similar to what you can hear on the CDs with exercises that come with those "foreign language for beginners" textbooks.

Ice_cream_suit · on Jan 3, 2022

Great opportunity for criminals and state actors to take identity theft to the next level.

graderjs · on Jan 4, 2022

Scoff...As if they didn't have this for 10 years already.

Ice_cream_suit · on Jan 4, 2022

Did they have your voice model so easily available, hosted on a poorly secured servers , until you decided to try out this new free toy ?

trompetenaccoun · on Jan 4, 2022

Professional criminals surely do have something like it already: https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-...

There is no reason to blame the creators, this is going to go mainstream one way or another.

martopix · on Jan 4, 2022

There is a video of me talking for an hour straight on youtube, for example.

matheist · on Jan 4, 2022

To help prevent malicious use, consider presenting the user with specific (randomly-generated) text to read aloud, and check (with speech to text) that they actually read that, instead of allowing them to say whatever they want.

That will help ensure that this is only being used by the person visiting the web page.

(That will only help with the hosted version, of course, not if you make the model code/weights available. I didn't generate this idea myself but also can't remember where I saw it. I think it was from someone offering a similar service.)

throw453221 · on Jan 4, 2022

This will be great for foreign movies. While I still prefer subtitles, for those who watch with dubs, it’ll be amazing to hear the actor’s “real” voice.

perryizgr8 · on Jan 4, 2022

I spoke for about a minute in English, having no idea what is the ideal length for it to properly figure out my voice. The result sounded like someone else completely. There was also some strange music in the background, which made me think that it was playing back a recording of a real person speaking! A real person who's not me.

abel_ · on Jan 4, 2022

An interesting reflection is how quickly research around TTS/STT has progressed. I remember reading [0] thinking we were a long ways away. And things will get way better with multi-task learning and multi-modal learning in the coming years (or months really).

In fact, just a year after this post was written, CoquiAI started their open source projects [1].

[0] https://news.ycombinator.com/item?id=22869365 (https://thegradient.pub/towards-an-imagenet-moment-for-speec...)

[1] https://star-history.com/#coqui-ai/TTS&coqui-ai/STT

jeroenhd · on Jan 3, 2022

Interesting. I like the addition of music to make sure it's not just a raw voice sample. The output I get seems to be a mix of a native speaker and my voice, because my (thick) accent is being filtered out.

I suppose that if I ever take proper English pronunciation classes, I now know what to strive for.

reaperducer · on Jan 4, 2022

Off topic, but this reminded me: What ever happened to that thing that Google demoed where its robots would call restaurants and make reservations for you? Did that ever find its way into Android, or another product?

lern_too_spel · on Jan 4, 2022

I don't know who makes reservations these days, but it's available almost everywhere in the US now from Google Assistant. https://support.google.com/business/answer/7690269#zippy=%2C...

netman21 · on Jan 4, 2022

Tried it. Just a voice to text of French guy talking. Definitely not my voice.

garfieldnate · on Jan 6, 2022

I put my voice in in English and asked for English back. The output mostly sounded the same, but the interesting thing was that it had some music playing in the background, like the ambient kind you might hear in a YT video while a narrator talks.

text2db · on Jan 6, 2022

I recorded my voice in English and it converted into French, but I heard after converting it to french voice, a music was heard in the sound after my voice was played , what was it.

Btw idea is really cool, its like how will you speak in same tone in other languages.

Bichote · on Jan 4, 2022

Very cool project but just a little nitpick; maybe use a picture of a real Coqui frog frog on your site? https://en.m.wikipedia.org/wiki/Coqu%C3%AD

thejosh · on Jan 4, 2022

This is great, my wife actually thought it was me speaking for a moment when she first came in!

yosito · on Jan 4, 2022

Very cool! If I were looking for a side project, I'd extend this, add a DeepL integration for automating translations, add some voice models for other languages/people and wrap it as a mobile app where people could pay to unlock the voice models.

zuhayeer · on Jan 4, 2022

Pretty neat! Was poking around on the site, and under the hood the interface to upload and render the audio is powered by Gradio: https://gradio.app/

bagels · on Jan 3, 2022

Is there a static demo that I don't have to provide my own voice for?

kdavis · on Jan 3, 2022

We did not provide such a demo in part to hinder nefarious uses of the technology.

1f60c · on Jan 3, 2022

You could provide a demo with a fixed prompt that will (for example) always read the first paragraph of the Wikipedia article about avocados.

crumpled · on Jan 3, 2022

Honestly, how much of a hinderance is that? A person could just supply a recording of another person, couldn't they?

trump_tts · on Jan 3, 2022

Yes, it's pretty easy to play back a video as the input text and then generate a reasonable fascimile. Here's Trump: https://sndup.net/zkbg/

gruez · on Jan 3, 2022

but it looks like you provide the source code here? https://github.com/coqui-ai/tts. How much of a hindrance are you hoping to add?

BoorishBears · on Jan 3, 2022

Discourages low-hanging, hit-and-run usage that's likely to get their site shut down.

If someone wants to fake a statement there are already 100 ways to do it. Not making their servers the ones doing the deed puts a meaningful barrier in place for more casual misuse. And for serious cases like impersonation on a large scale, the resources are there to likely do better than this instant feedback model can.

xwdv · on Jan 3, 2022

Noble, but the genie will be out of that bottle soon enough, if not already.

reubenmorais · on Jan 3, 2022

The project page has a bunch of pre-rendered samples and ground truths: https://edresson.github.io/YourTTS/

bagels · on Jan 4, 2022

Thanks, this is what I meant.

Rhinobird · on Jan 4, 2022

My question is, how long until we have automatically dubbed anime?

var_cw · on Jan 4, 2022

Pretty soon. But won't you prefer subs over dubs for anime?

akeck · on Jan 3, 2022

Is it supposed to translate or just read with the target accent? For me, it's only reading the English input text with the target accent.

reubenmorais · on Jan 3, 2022

It doesn't translate the text, you have to put in text in the target language. But you can record audio speaking in any language you want.

rambojazz · on Jan 5, 2022

Can tools like this exist in offline mode only? Or do they require some high-power computers for the model?

dutchbrit · on Jan 4, 2022

Weird, my voice turned American...

raphman · on Jan 3, 2022

Nice! The first few seconds sound a lot like me. Afterwards, not so much.

mulholio · on Jan 4, 2022

Going to test this out on Hinge voice prompts and see what happens

milkers · on Jan 4, 2022

What kind of wizardy is this!? Congrats Coqui team!

momolo · on Jan 3, 2022

is the model available?

_josh_meyer_ · on Jan 3, 2022

Demo: https://coqui.ai Code: https://github.com/coqui-ai/tts Blogpost: https://coqui.ai/blog/tts/yourtts-zero-shot-text-synthesis-l... Paper: https://arxiv.org/abs/2112.02418

echelon · on Jan 3, 2022

This is so cool! Thank you!

How do y'all intend to profit (succeed as a startup) if you're releasing so much publicly? I'd love to see you guys succeed.

Really great to see where some of the Mozilla TTS folks wound up, too.

ceva · on Jan 3, 2022

it says enter your text here ..

kdavis · on Jan 3, 2022

You're free to enter any input sentence you want in the text box.

The input sentence generally should be in the language you selected from the dropdown. For example, if the dropdown has "French" selected you could enter the text "Allons enfants de la Patrie, Le jour de gloire est arrivé!"

Clicking "Submit" then generates a TTS reading of the sentence you input in the language selected from the dropdown.

For fun you can mix and match. In other words, select a language from the drop down and enter text in the text box not in the language selected from the dropdown. (For example, the dropdown could have "French" selected and the sentence could be "O say can you see, by the dawn's early light". This gives interesting results, it sounds as if a native French speaker is speaking English.)

jijji · on Jan 4, 2022

what about a real time white english to ebonics/jive translator, when are we going to see this...

acqbu · on Jan 3, 2022

Gold!

wombatmobile · on Jan 3, 2022

Awesome!

How do I embed this?