I wonder if someone will develop a "The quick brown fox jumped over the lazy dog" for English pronunciation. Something you could read aloud that would cover all the sounds they needed to build something like this.
It'd be a cool graduate project... kinda wish I was into linguistics right now.
It would probably be several paragraphs long, at the shortest. Depending on accent and cultural upbringing, a person varies how they'd pronounce a phoneme depending on nearby sounds, words, or even sentences.
I am very annoyed by the current brute-force, heuristic approaches to human-sound acoustics. I wish the sounds were dynamically computerized by way of mechanical simulations of the anatomical parts involved in human speech articulation.
> I wish the sounds were dynamically computerized by way of mechanical simulations of the anatomical parts involved in human speech articulation.
Actually, I think that's what was attempted in the first place in the early 80's. I remember seeing TV shows and museum exhibits that demonstrated this approach. One I especially remember, and highly dates the efforts, had a vector imaging display (think of the original Tempest and Asteroids arcade games) project a silhouette of a tongue and vocal cavity to demonstrate how the current phoneme was generated to listeners.
Of course, back then, such simulacra were limited by lack of parallel processing power and inadequate understanding of biophysics. This lead to the brute force "sound sampling" approach nowadays as memory became more cheap and audio capture hardware was perfected. I do wonder if it's time to return to vocal anatomy modeling again, now that we have a better understanding of how to perform biometric and physics modeling via massive computational parallelism.
I imagine the reason why progress on this model has been slow is how very extremely challenging the task is. It would require a sturdy knowledge of linguistics, physics, computer programming, etc. The sampling model in contrast is a piece of cake.
The anatomical model does indeed sound very interesting. Each phoneme would be recognized as one particle, on which intonation and dynamics effects could be applied algorithmically; and much of advancement in this area would be employable by speech recognition models, probably increasing their accuracy by a considerable amount.
I really hope some serious contenders step up to the plate for this.
NPR's All Things Considered has a short interview with the CTO of CereProc: http://www.npr.org/templates/story/story.php?storyId=1240872... The voices aren't perfect but they're definitely better than anything else I've heard. Ebert's voice isn't demoed in the interview. I'm guessing Oprah wants to be the first to show it.
This combined with improved subvocal stuff like http://www.youtube.com/watch?v=xyN4ViZ21N0 would make silent, covert voice communication possible. No more annoying one-sided conversations from cell phones.
Isn't this mainly an aesthetic thing, though? Consumer text-to-speech is certainly adequate for vocalizing almost any conversation. One could just use that to carry out "covert voice communication."
At the end of the YouTube video, he mentions being able to think "nearest bus" and having it query the internet and speak the results to you. This would allow you to augment reality without pulling out your phone, unlocking it, launching Google Maps, selecting your location, etc, etc. Sure, it sounds like the flying car dreams of the last century, but given they can recognize 150 words now, it hasa lot of future potential.
Right after Ebert mentioned Alex, I stopped reading, selected the text of the article, went to OS X's Services menu, and listened to Alex read the rest of it.
It reminded me how far consumer voice synthesis has yet to come, but it did give me a better appreciation of some of the more subtle things the Alex voice has in terms of intonation. Despite still sounding obviously synthetic, it's obviously doing quite a bit of analysis on the sentence structure to vary the pitch in a natural way.
But that makes me wonder: Why, when complex things like structural intonation are already in consumer TTS products, do (deceptively) simple things like consonant sounds and pacing still sound so stilted?
Very cool technology, can't wait to hear what it sounds like compared to his real voice. The only downside of course is normal folks don't have isolated audio of themselves speaking. The company really needs to figure out how to isolate it themselves so home movies and voicemails could be used without the background noise affecting quality.
For me although it sounds better then previous years ... the text to speech offered by this cerpra company still sounds robotic. An area interest for my start-up as we chose to use real actors rather then a text to speech engine.
It appears to me that the robot sound is exacerbated by the fact that we modify the way we pronounce words based on what other words bookend them. how are you escaping this with live actors?
It'd be a cool graduate project... kinda wish I was into linguistics right now.