But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak, especially if that language comes from very old texts (which many public domain books are indeed quite old).
Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.
Third, the tone and cadence of read speech is different than that of spontaneous speech (the Common Voice dataset also has this problem, but they are coming up with ideas on how to prompt for spontaneous speech too).
But the goal of Common Voice was never to replace LibreSpeech or other open datasets (like TED talks) as training sets, but rather to compliment them. You mention transfer learning. That is indeed possible. But it's also possible to simply put several datasets together and train on all of them from scratch. That is what Mozilla's DeepSpeech team has been doing since the beginning (you can read the above hacks blog post from Reuben Morais for more context there).
> Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.
It shouldn't be that hard to degrade the quality synthetically? And with a clean source you can synthesize different types of noise/distortions.
I can't speak for voice data, as I've not worked with voice, but I did my MSc on various approaches for reducing error rates for OCR. I used a mix of synthetically degraded data ranging from applying different kinds of noise to physically degrading printed pages (crumpling, rubbing sand on them, water damage), and while it gave interesting comparative results between OCR engines, the types of errors I got never closely matched the types of errors I got from finding genuine degraded old books. I've seen that in other areas too.
My takeaway from that was that while synthetic degradation of inputs can be useful, and while it is "easy", the hard part is making it match real degradation closely enough to be representative. It's often really hard to replicate natural noise closely enough for it be sufficient to use those kind of methods.
Doesn't mean it's not worth trying, but I'd say that unless voice is very different it's the type of thing that's mostly worth doing if you can't get your hands on anything better.
You might say, if you can identify and simulate all cases of real life degradation, your problem is basically solved, just reverse the simulation on your inputs.
I’m not saying ocr isn’t hard. I’m saying normalizing all those characters basically is the problem.
> And with a clean source you can synthesize different types of noise/distortions.
You'd have to know what the most common types of noise are, how they interact with the signal, etc. This method of collecting data can provide useful info on what that noise actually is.
tone and cadence of read speech is different than that of spontaneous speech
I don't think most people speak to their phone the same way they normally speak.
For example, I always speak slowly, with perfect pronunciation and intonation when talking to Siri.
> But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak,
The problem with the 'problem' you're describing is the scope of speech recognition is being defined too narrowly.
If all you care about is creating an open source Alexa/Siri knockoff, then yes you need to recognize conversational speech and much else. But what if you do want to recognize scripted rehearsed speech? What if you want a speech recognizer that can auto-transcribe movies, news broadcasts, or in fact audio books? Wouldn't it be nice if all audiobooks came with aligned text? That's an experience you can get right now with kindle/audible, but as far as I'm aware no FOSS ebook reading software supports it. If I have a public domain reading of Tom Sawyer from LibreVox and a text copy of Tom Sawyer from Project Gutenberg, how many hoops do I currently have to jump through to get the speech highlighted on my screen as the audiobook plays?
Recognizing all forms of speech should be the goal, not just one narrow albeit trendy sliver of speech.
But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak, especially if that language comes from very old texts (which many public domain books are indeed quite old).
Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.
Third, the tone and cadence of read speech is different than that of spontaneous speech (the Common Voice dataset also has this problem, but they are coming up with ideas on how to prompt for spontaneous speech too).
But the goal of Common Voice was never to replace LibreSpeech or other open datasets (like TED talks) as training sets, but rather to compliment them. You mention transfer learning. That is indeed possible. But it's also possible to simply put several datasets together and train on all of them from scratch. That is what Mozilla's DeepSpeech team has been doing since the beginning (you can read the above hacks blog post from Reuben Morais for more context there).