I was thinking of using aspiring scriptwriters and playwrights do do this. Plays and films are (nominally) word-for-word with the script, but are delivered in a way that needs to be convincing to the audience.
Similarly, plays and scripts strike me as good training data for the backend, as they mimic natural speech patterns (as opposed to novels or scientific research articles). Markov once again becomes your friend here.
The arts get a little funding and some recognition, and some company gets to collect audio gold for speech recognition and synthesis.
Couple that with a couch potato concept, and you might have something fun to hack on.
At least for machine translation a similar approach (Europarl parallel language set) has been tried. Google translate scores high against this set, afaik.
Oh but you could also make the program translate! At least high quality subtitles are more easily available than the on demand stuff you see on live newscasts.
For a number of countries, native tongue subtitles of English TV shows is the norm. They're usually pretty good. So what you would have is a way for the program to translate English to the native tongue on the fly.
Similarly, plays and scripts strike me as good training data for the backend, as they mimic natural speech patterns (as opposed to novels or scientific research articles). Markov once again becomes your friend here.
The arts get a little funding and some recognition, and some company gets to collect audio gold for speech recognition and synthesis.
Couple that with a couch potato concept, and you might have something fun to hack on.