Audiobooks are definitely possible for ASR training. Indeed the largest open ASR...

olejorgenb · on Feb 28, 2019

> Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.

It shouldn't be that hard to degrade the quality synthetically? And with a clean source you can synthesize different types of noise/distortions.

vidarh · on Feb 28, 2019

I can't speak for voice data, as I've not worked with voice, but I did my MSc on various approaches for reducing error rates for OCR. I used a mix of synthetically degraded data ranging from applying different kinds of noise to physically degrading printed pages (crumpling, rubbing sand on them, water damage), and while it gave interesting comparative results between OCR engines, the types of errors I got never closely matched the types of errors I got from finding genuine degraded old books. I've seen that in other areas too.

My takeaway from that was that while synthetic degradation of inputs can be useful, and while it is "easy", the hard part is making it match real degradation closely enough to be representative. It's often really hard to replicate natural noise closely enough for it be sufficient to use those kind of methods.

Doesn't mean it's not worth trying, but I'd say that unless voice is very different it's the type of thing that's mostly worth doing if you can't get your hands on anything better.

jfoutz · on Feb 28, 2019

You might say, if you can identify and simulate all cases of real life degradation, your problem is basically solved, just reverse the simulation on your inputs.

I’m not saying ocr isn’t hard. I’m saying normalizing all those characters basically is the problem.

dbdjfjrjvebd · on March 1, 2019

This isn't quite true if e.g. there are degenerate cases.

badfrog · on Feb 28, 2019

> And with a clean source you can synthesize different types of noise/distortions.

You'd have to know what the most common types of noise are, how they interact with the signal, etc. This method of collecting data can provide useful info on what that noise actually is.

p1esk · on Feb 28, 2019

tone and cadence of read speech is different than that of spontaneous speech

I don't think most people speak to their phone the same way they normally speak. For example, I always speak slowly, with perfect pronunciation and intonation when talking to Siri.

snek · on Feb 28, 2019

I think the goal is to let speak to your phone normally, not to continue having to speak like your phone is hard of hearing.

darkpuma · on Feb 28, 2019

> But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak,

The problem with the 'problem' you're describing is the scope of speech recognition is being defined too narrowly.

If all you care about is creating an open source Alexa/Siri knockoff, then yes you need to recognize conversational speech and much else. But what if you do want to recognize scripted rehearsed speech? What if you want a speech recognizer that can auto-transcribe movies, news broadcasts, or in fact audio books? Wouldn't it be nice if all audiobooks came with aligned text? That's an experience you can get right now with kindle/audible, but as far as I'm aware no FOSS ebook reading software supports it. If I have a public domain reading of Tom Sawyer from LibreVox and a text copy of Tom Sawyer from Project Gutenberg, how many hoops do I currently have to jump through to get the speech highlighted on my screen as the audiobook plays?

Recognizing all forms of speech should be the goal, not just one narrow albeit trendy sliver of speech.