I was blown away by how impressive it was. I honestly thought it was real. I sti...

JonathanFly · 2024-09-30T04:40:20 1727671220

Bark can sound as good, but Google is using SoundStorm which was specifically trained on dialogs. Surprisingly Bark can even sort of match it without being trained to do so, but not reliably. (https://x.com/jonathanfly/status/1675987073893904386)

And SoundStorm has more than twice the context window of Bark so dialogs are a tight fit.

ranger_danger · 2024-09-30T04:52:48 1727671968

I just tried the default bark.cpp example from the github readme, and to me it still doesn't sound close enough to realistic, and the audio quality itself was a bit scratchy... maybe I'm doing something wrong.

When I tried my own text with it, it went completely off the rails... skipping completely over random words, and also switching to different voices in the middle of a sentence. Trying to run the large model also crashed entirely.

JonathanFly · 2024-09-30T06:01:57 1727676117

You aren't doing anything wrong - Bark out the box uses a randomly generated voice and I like to think it's modeling the world of random voices which includes bad microphones/audio-quality. (Even bad 'actors' - see how many Bark voices sound like they are reading a script.)

Presumably it was trained in noisy data. But it can generate and use a clean voice, they are in there. Most of the Suno default voices are not great either - but a great voice can sound perfectly clear. I haven't done much with Bark lately but on my Twitter there's plenty of clear examples of very realistic voices. Actually here I ran a prompt based on some copy and pasted test 20 times in Bark. I put a couple better results up front, but even in later samples you can find lots of evidence of human-sounding voices. https://sndup.net/bzhz5/

Going off the rails and hallucinating is a hard problem. It can be minimized, but probably would have to solved with simple brute force (check the output with S2T and retry if needed.)

For raw audio you can replace the final decoding step with something like VOCOS or MBD if you want to maximize audio quality, though you don't need do with the best voices.