Yup, and that's going to be the case until AI's can really model human psychology.
Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.
If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it.
So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology.
It's a gigantic difference from text, which encodes vastly less information.
(And this is one of the reasons I don't see AI replacing actors for a looong time, not even voice actors. You can map a voice onto someone else's voice preserving their prosody, but you still need a skilled human being producing the prosody in the first place.)
What if you have it read the script, then say, “hey, at this point, what is the character feeling? What are they trying to accomplish? What is there relationship to each person in the scene?”
And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.
It seems like it could definitely do the first part (“based on this text, this character might be feeling X”); the second part (“mark up the dialogue”) seems easier; the third part about speech seems doable already based on another comment.
So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.
> What if you have it read the script, then say, “hey, at this point, what is the character feeling?...
Sure, but now how do you make sure all the answers to those questions are consistent? Across clauses, sentences, paragraphs? To do that, you need to have an entire understanding of human psychology.
And I haven't seen any evidence that LLM's possess that kind of knowledge at all, except at the most rudimentary level of narrative.
Just think of how even professional directors struggle to communicate to an actor the emotional and psychological feeling they're looking for. We don't even have words or labels for most of the things, and we say "you know how you feel in a situation when <a> and <b> but <c>? You know that thing? No, not that, but when <d>. Yeah, that." Most of these things operate on an intuitive, pre-verbal level of thinking in our brain. I don't think LLM's are anywhere close to being able to capture that stuff yet.
Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.
If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it.
So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology.
It's a gigantic difference from text, which encodes vastly less information.
(And this is one of the reasons I don't see AI replacing actors for a looong time, not even voice actors. You can map a voice onto someone else's voice preserving their prosody, but you still need a skilled human being producing the prosody in the first place.)