Whenever I walk my dog I find myself wanting a conversationalist LLM layer to exist in the best form. LLM's now are great at conversation, but the connective tissue between the LLM and natural dialog needs a lot of work.
Some of the problems:
- Voice systems now (including ChatGPT mobile app) stop you at times when a human would not, based on how long you pause. If you said, "I think I'm going to...[3 second pause]" then LLM's stop you, but a human would wait
- No ability to interrupt them with voice only
- Natural conversationalists tend to match one another's speed, but these system's speed are fixed
- Lots of custom instructions needed to change from what works in written text to what works in speech (no bullet points, no long formulas)
On the other side of this problem is a super smart friend you can call on your phone. That would be world changing.
Yeah. While I like the idea of live voice chat with an LLM, it turns out I’m not so good at getting a thought across without pauses, and that gets interpreted as the LLM’s turn to respond. I’d need to be able to turn on a magic spoken word like “continue” for it to be useful.
pyryt posted https://arxiv.org/abs/2010.10874, which might be helpful here, but we probably end off with personalized models that learned from conversation styles. A magic stop/processing word would be the easiest to add since you already have the transcript, but it's taking the natural feel of a conversation.
Good point; another area we are currently looking into is predicting intention; often, when talking to someone, we have a good idea of what that person might say next. That would not only help with latency but also, allow us to give better answers, and load the right context.
I think the Whisper models need to predict end-of-turn based on content. And if it still gets input after the EOT, it can just drop the LLM generation and start over at the next EOT.
Some of the problems:
- Voice systems now (including ChatGPT mobile app) stop you at times when a human would not, based on how long you pause. If you said, "I think I'm going to...[3 second pause]" then LLM's stop you, but a human would wait
- No ability to interrupt them with voice only
- Natural conversationalists tend to match one another's speed, but these system's speed are fixed
- Lots of custom instructions needed to change from what works in written text to what works in speech (no bullet points, no long formulas)
On the other side of this problem is a super smart friend you can call on your phone. That would be world changing.