Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was blown away by how impressive it was. I honestly thought it was real. I still can't believe these realistic audio capabilities are not being used for pure evil everywhere we look.

> like thrown into every sentence

I think that's actually part of why it sounds real, because tons of people do actually talk like that.

To me what would make it even better is the ability to throw in random jokes and utilize information about their surroundings and recent events.

I have been using MeloTTS for text-to-speech and I thought that was about the best we could do right now, but apparently I was very wrong. Is there an offline model one can download today that sounds as good as this NotebookLM?



Bark can sound as good, but Google is using SoundStorm which was specifically trained on dialogs. Surprisingly Bark can even sort of match it without being trained to do so, but not reliably. (https://x.com/jonathanfly/status/1675987073893904386)

And SoundStorm has more than twice the context window of Bark so dialogs are a tight fit.


I just tried the default bark.cpp example from the github readme, and to me it still doesn't sound close enough to realistic, and the audio quality itself was a bit scratchy... maybe I'm doing something wrong.

When I tried my own text with it, it went completely off the rails... skipping completely over random words, and also switching to different voices in the middle of a sentence. Trying to run the large model also crashed entirely.


You aren't doing anything wrong - Bark out the box uses a randomly generated voice and I like to think it's modeling the world of random voices which includes bad microphones/audio-quality. (Even bad 'actors' - see how many Bark voices sound like they are reading a script.)

Presumably it was trained in noisy data. But it can generate and use a clean voice, they are in there. Most of the Suno default voices are not great either - but a great voice can sound perfectly clear. I haven't done much with Bark lately but on my Twitter there's plenty of clear examples of very realistic voices. Actually here I ran a prompt based on some copy and pasted test 20 times in Bark. I put a couple better results up front, but even in later samples you can find lots of evidence of human-sounding voices. https://sndup.net/bzhz5/

Going off the rails and hallucinating is a hard problem. It can be minimized, but probably would have to solved with simple brute force (check the output with S2T and retry if needed.)

For raw audio you can replace the final decoding step with something like VOCOS or MBD if you want to maximize audio quality, though you don't need do with the best voices.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: