Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>It's not possible to mimic natural conversational rhythm with a separate LLM generating text, a separate text-to-speech generating audio, and a separate VAD determining when to respond and when to interrupt.

If you load the system prompt with enough assumptions that it's a speech-impared subtitle transcription that follows a dialogue you might pull it off, but likely you might need to fine tune your model to play nicely with the TTS and rest of setup



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: