Anyone know why the default voice is set to be so bad?

mrob · on May 2, 2024

Why specifically do you consider it to be bad? Espeak-ng is primarily an accessibility tool, used as the voice synthesizer for screen readers. Clarity at high speed is more important than realism.

vlovich123 · on May 2, 2024

That can’t be a serious question. Go look at the accessibility voice for windows or Mac and then compare the way it sounds. Both of those are both more human like with better pronunciation.

lomase · on May 3, 2024

To you.

rhdunn · on May 2, 2024

The default voice sounds robotic for several reasons. It has a low sample rate to conserve space. It is built using a mix of techniques that make it difficult to reconstruct the original waveform exactly. And it uses things like artificial noise for the plosives, etc.

The default voice is optimized for space and speed instead of quality of the generated audio.

vlovich123 · on May 2, 2024

I’ll suggest that’s the wrong optimization to make for an accessibility tool. Modern CPUs are more than capable of handling its speed requirements by several orders of magnitude (they can decode h265 in real time for gods sake without HW acceleration). And same goes for size.

It’s simply the wrong tuning tradeoff.

codedokode · on May 2, 2024

But today disk space is not an issue.

follower · on May 2, 2024

As I've learned over time (and other people in these comments have clarified) it turns out that evaluating "quality" of Text To Speech is somewhat dependent on the domain in which the audio output is being used (obviously with overlaps), broadly:

* accessibility

* non-accessibility (e.g. voice interfaces; narration; voice over)

The qualities of the generated speech which are favoured may differ significantly between the two domains, e.g. AIUI non-accessibility focused TTS often prioritises "realism" & "naturalness" while more accessibility focussed TTS often prioritizes clarity at high words-per-minute speech rates (which often sounds distinctly non-"realistic").

And, AIUI espeak-ng has historically been more focused on the accessibility domain.

vlovich123 · on May 2, 2024

I don't have any disabilities so I don't know if espeak-ng is better on the pure accessibility axis. But given that MacOS tends to be received quite well by the accessibility crowd & it's definitely a focus from what I observed internally, given that MacOS has much higher realism & naturalness out of the box, I'm going to posit that it's not the linear tradeoff argument you've made & that espeak-ng defaults aren't tuned well out of the box.