Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Blind person here, ESpeak-ng is literally what I use on all of my devices for most of my day, every day.

I switched to it in early childhood, at a time where human-sounding synthesizers were notoriously slow and noticeably unresponsive, and just haven't found anything better ever since. I've used Vocalizer for a while, which is what iOS and Mac OS ship with, but then third-party synthesizer support was added and I switched right back.



How fast do you set speech playback speed/rate?

I tried a bunch of speech synthesis, with speed and intelligibility in mind.

ESpeakng-ng barely intelligible past ~500 words per minute, and just generally unpleasant to listen to. Maybe my brain just can't acclimatize to it.

Microsoft Zira Mobile (unlock on win11 desktop via regex) sounds much more natural and intelligible at max windows SAPI speech rate, which I estimate is around ~600 and equivalent to most conversation/casual spoken word at 2x speed. I wish windows could increase playback even further, my brain can process 900-1200 words per minute or 3x-4x normal playback speed.

On Android, Google's "United States - 1" sounds a little awkward but also intelligible at 3x-4x speed.


Similar to OP if information is low density like a legal contract I can do 1200wpm after a few hours of getting used to it. Daily normal is 600wpm, if the text is heavy going enough I have to drop it down to 100 wpm and put it on loop.

Like usual the limit isn't how fast human io is but how fast human processing works.


Yeah 600wpm is passive listening. 900-1200wpm is listening lecture on youtube at 3-4x speed. Skim listening for content I'm familiar with. Active listening for things I just want to speed through. It's context dependent, I find I can ramp up 600-1200 and get into flow state of listening.

>text is heavy going enough I have to drop it down to 100 wpm

What is heavy text for you? Like very dense technical text?

>put it on loop

I find this very helpful as well, but for content I consume, not very technical, I listen at ~600wpm and loop it multipe times. It's like listening a song to death. Engrain it on a vocal / story telling level.

E: semi related comment to a deleted comment about processing speed that I can no longer reply to. Posting here because related.

Some speech synthesis are much more intelligible at higher speeds, and aids processing at higher wpms. What I've been trying to find is the most intelligible speech synthesis voice for upper limit of concentrated/burst listening which for me is around 1200wpm / 4x speed, i.e. many have wierd audio artefacts past 3x. There's synthesis engines whose high speed intelligbility improves if text is processed with SSML markup to add longer pauses after punctuation. Just little tweaks that makes processing easier. Doesn't apply to all content, all contexts, but I think some consumption are suitable for that, and it's something that can be trained like many mental tasks, and dedicated speech synthesis like fancy sport equipments improve top end performance.

IMO also something neural model can be tuned for. There are some podcasters/audiobook narrators who are "easy" listening at 3x speed vs others because they just have better enunciation/cadence at same word density. Most voices out there from traditional SAPI models to neural are... very mid fast "narrators". Think need to bundle speech sythensis with content awareness - AI to filter content then synthesis speech that emphasis/slow on significant information, breeze past filler - just present information more efficiently for consumption.


Thanks for the heads-up. May I ask you if you know websites / articles that explain daily setups for blind people ? I had issues that required me not to rely on sight and I couldn't find much.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: