Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interestingly, some of the robot styles take a very obvious and dramatic fake breath. I say "fake" since a robot doesn't need to breathe and it's not exactly considered a phoneme. The fake breaths don't really make the robot sound more convincing.

When you listen to the first example labelled "Narrative" you can tell where a human speaker would have inhaled (which is something the AI could have picked up on from copious training data) though the inhale itself could be muted in post-editing, e.g. after the long 24-word first phrase[1] ending in "special magnificence", and then again at the end of the sentence. It could just be the way the AI reads the comma but it is very convincing.

The "News" and "Conversational" examples don't include that pause effect. In the cerulean monologue, there is no pause after "for instance" despite it being in the monologue.

However, the robot takes a deep dramatic breath after the word "I see"[2]. " Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet and you select I don't know that lumpy blue sweater for instance because you're trying to tell the world that you take yourself". There is no pause on the comma around "for instance" though the script has one. I decided to check whether the robot is just copying the original film exactly and that's not it either.[3]

Comparison:

    Robot: "Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet [no breath] and you select I don't know that lumpy blue sweater for instance [QUICK HALF BREATH BY ROBOT] because you're trying to tell the world [no breath] that you take yourself too seriously to care about what you put on your back but [no breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."

    Original: "Oh, okay. I see [no breath] you think this has nothing to do with you. [loud long breath] You go to your closet [breath] and you select I don't know that lumpy blue sweater for instance [no breath] because you're trying to tell the world that you [breath] take yourself too seriously to care about what you put on your back but [breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."
Text: "Oh, okay. I see, you think this has nothing to do with you.

You… go to your closet, and you select… I don’t know, that lumpy blue sweater for instance, because you’re trying to tell the world that you take yourself too seriously to care about what you put on your back, but what you don’t know is that that sweater is not just blue, it’s not turquoise, it’s not lapis, it’s actually cerulean. "

I've annotated the breaths in the "conversational" robot sample vs the original film:

                     Robot                  Original                Same/different?
     I see...        [Loud breath]          [no breath]             Different
     with you...     [Loud quick breath]    [loud long breath]      Similar
     your closet...  [no breath]            [breath]                Different
     for instance... [QUICK half breath]    [no breath]             Different
     that you...     [no breath]            [breath]                Different
     back but...     [no breath]            [breath]                Different
The robot's loud dramatic breath is unmistakable, but it's clear it's not copying the source exactly, since it occurs at different places.

[1] The text is here: https://www.nytimes.com/2001/11/19/books/chapters/the-lord-o...

[1] The text is here: https://artdepartmental.com/blog/devil-wears-prada-cerulean-...

[2] https://www.youtube.com/watch?v=us52N76XA28&t=1m24s



(ElevenLabs dev here) The generative voices and the way they sound is very much a function all the training data, sampling and interpolation as you also pointed out. As a lot of these do involve deep breaths, that why synthesized voice will also have it present albeit at sometimes different times than human. Interpunction is the biggest influence on where those pauses will happen.

From the users so far they found it actually enjoyable to listen to and that the breathing and pauses are accurate!


I agree - the pauses in the first sample called "Narration" are incredibly accurate and pleasant to listen to.

As a developer, can you tell the difference between "Narration" and the human speaker? What can we listen for or what gives it away? For me I listened to the "Narration" clip many times and as a native British English speaker also confirms in another comment, it seems very difficult/impossible to tell the first clip is generated. Congratulations on such an achievement!


I noticed a breath in the demo audio in the linked article and while it stood out, I was impressed by it rather than thinking it felt forced. I'm sure if I listened to enough AI voice it would stand out more and feel forced.


Did you find the whole clip it was in convincing? For me, I didn't even notice the breath but the entire second and third clip felt obviously AI-generated. But the first clip sounded absolutely real (maybe with some compression artifacts - see my other comment.)

Later when I went back and listened carefully for why the first clip felt so "real" I noticed it had pauses. (No breaths per se but they are sometimes removed from edited audio.) However, I then noticed that the conversational clip, which felt unnatural to me, had very obvious breaths. The entire effect of the conversational clip didn't sound like a human at all. It sounded like an AI.

Did you find the whole conversational clip "convincing"? (Did it sound like a human to you?) How about the narration clip?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: