Hacker News new | past | comments | ask | show | jobs | submit login

I have a (probably dumb) question: why don't we just use audiobooks for this? There are thousands and thousands of hours where the transcripts were written, and then read aloud. Some of them are now public domain. I'm sure some validation would need to be done, but it seems like there would be an endless validation set there. Am I missing something obvious?



Audiobooks are definitely possible for ASR training. Indeed the largest open ASR training dataset before Common Voice was LibriSpeech (http://www.openslr.org/12/). Also note, the first release of Mozilla's DeepSpeech models were trained and tested with LibriSpeech: https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error...

But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak, especially if that language comes from very old texts (which many public domain books are indeed quite old).

Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.

Third, the tone and cadence of read speech is different than that of spontaneous speech (the Common Voice dataset also has this problem, but they are coming up with ideas on how to prompt for spontaneous speech too).

But the goal of Common Voice was never to replace LibreSpeech or other open datasets (like TED talks) as training sets, but rather to compliment them. You mention transfer learning. That is indeed possible. But it's also possible to simply put several datasets together and train on all of them from scratch. That is what Mozilla's DeepSpeech team has been doing since the beginning (you can read the above hacks blog post from Reuben Morais for more context there).


> Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.

It shouldn't be that hard to degrade the quality synthetically? And with a clean source you can synthesize different types of noise/distortions.


I can't speak for voice data, as I've not worked with voice, but I did my MSc on various approaches for reducing error rates for OCR. I used a mix of synthetically degraded data ranging from applying different kinds of noise to physically degrading printed pages (crumpling, rubbing sand on them, water damage), and while it gave interesting comparative results between OCR engines, the types of errors I got never closely matched the types of errors I got from finding genuine degraded old books. I've seen that in other areas too.

My takeaway from that was that while synthetic degradation of inputs can be useful, and while it is "easy", the hard part is making it match real degradation closely enough to be representative. It's often really hard to replicate natural noise closely enough for it be sufficient to use those kind of methods.

Doesn't mean it's not worth trying, but I'd say that unless voice is very different it's the type of thing that's mostly worth doing if you can't get your hands on anything better.


You might say, if you can identify and simulate all cases of real life degradation, your problem is basically solved, just reverse the simulation on your inputs.

I’m not saying ocr isn’t hard. I’m saying normalizing all those characters basically is the problem.


This isn't quite true if e.g. there are degenerate cases.


> And with a clean source you can synthesize different types of noise/distortions.

You'd have to know what the most common types of noise are, how they interact with the signal, etc. This method of collecting data can provide useful info on what that noise actually is.


tone and cadence of read speech is different than that of spontaneous speech

I don't think most people speak to their phone the same way they normally speak. For example, I always speak slowly, with perfect pronunciation and intonation when talking to Siri.


I think the goal is to let speak to your phone normally, not to continue having to speak like your phone is hard of hearing.


> But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak,

The problem with the 'problem' you're describing is the scope of speech recognition is being defined too narrowly.

If all you care about is creating an open source Alexa/Siri knockoff, then yes you need to recognize conversational speech and much else. But what if you do want to recognize scripted rehearsed speech? What if you want a speech recognizer that can auto-transcribe movies, news broadcasts, or in fact audio books? Wouldn't it be nice if all audiobooks came with aligned text? That's an experience you can get right now with kindle/audible, but as far as I'm aware no FOSS ebook reading software supports it. If I have a public domain reading of Tom Sawyer from LibreVox and a text copy of Tom Sawyer from Project Gutenberg, how many hoops do I currently have to jump through to get the speech highlighted on my screen as the audiobook plays?

Recognizing all forms of speech should be the goal, not just one narrow albeit trendy sliver of speech.


When training speech recognition systems you want to use data that closely matches your target domain. Models trained on audiobooks read by professionals will not perform very well for transcribing conversational or spontaneous speech or if there is background noise.


But, if I understand correctly, systems can be trained separately on "this is background noise" and then apply those filters first, and then work with cleaned audio, right? I've been using krisp.ai for a few weeks and it has been fantastic at doing exactly that in real-time.

Regarding conversational speech, I get that. Books are definitely not conversational.

I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>


You're right, these issues can also be tackled independently. Transfer learning can help, but my first guess would be that it's hard to get reasonable accuracy (= usable for applications) without hundreds of hours of conversational data. You could also attempt to directly modify the audiobook data by manually adding noises or performing other distortions.

In any case, for read speech in particular there are several corpora out there already, including the moderately large LibriSpeech corpus (1000hr). The state-of-the-art accuracy on read speech is also very good -- for example, domain-specific dictation systems have been commercially viable for quite some time. So while it's true that Audiobooks are a large untapped source, I think that there are other large-scale and richer options like YouTube or movies (i.e. videos with speech for which subtitles are available) that would be more useful to make progress towards good speech recognition systems.


> videos with speech for which subtitles are available

The subtitles often don't match what is spoken exactly.


Self reply with more questions/thoughts. based on what I know, it seems like the problem could break down as:

1. we have a lot of training data for the voices of white men reading stuff. 2. We have good models that already exist for removing background noise. 3. We might be able to build good models that could identify accents, gender, age variation. 4. we have good models for style transfer that work in the audio domain.

Could we take an audiobook read by a white guy, and use a style transfer model to give him a german accent, and then use the german accented version as training data back into the speech recognition model? Could you use a reverse style transfer model to turn accented audio into non-accented audio (i.e. normalize it all to the place where we have the most training data) Could we use a combination of style transfer models to vastly expand the training data set, and then train the conversational systems?

Or, are the style transfer models not good enough? Or do we not have training data for style transfers to turn the voices of white men into the voices of white men with german accents?

I don't want to trivialize, but I'm genuinely curious how professionals are actually trying to solve this now?


> I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>

Understanding all words is not the problem. I don't know if it's universal, but frequently, a speech-to-text model is actually two models: A voice model (mapping raw audio to phonemes) and a language model (which models what the language looks like, i.e. what sentences are likely and which words exist). So if you want the STT system to understand novels, include novels in the training data for the language model. You can then combine it with a voice model suitable for conversational speech/the user's accent/background noise.


Transfer learning is not guaranteed to work well. Most learned features even in the first layers look usually very different if trained in a clean environment. Background noise is not just a simple stationary signal, but very different audio patterns like music or other voices.


Audiobooks have a very small number of accents. Often they’re recorded by middle-aged, or older men with a standard accent. (Think newscasters, and most actors)

One of the advantages of CommonVoice is that breadth of voices. By having more, and different types of voices, you can build a more robust system, which works for more people.

For instance, my wife is a nonnative English speaker, and had trouble for years using Siri. Alexa can’t understand small children. And of course, elevators can’t understand Scottish accents. https://m.youtube.com/watch?v=NMS2VnDveP8


> Audiobooks have a very small number of accents.

Not so much the case on LibriVox. Accents, age and levels of voice professionalism vary greatly.


Speaking and reading out loud are vastly different. Try it yourself: record yourself explaining a concept with zero preparation and then read a text explaining the concept.

Also (I am not a lawyer but) just because the source material is out of copyright doesn't mean the audiobook is out of copyright.


The parent comment was presumably talking about the massive library of unencumbered audio books based on those books, such as those created by LibriVox.


Audiobook narrators enunciate (almost) perfectly. Real people don't.


You need to line up the timestamps for each spoken sentence to each written sentence. Some data sources, like Amazon's "Whisper Sync" can do this already.


Pretty sure it wont be nearly enough both from the language variety point of view and the variety of voices (age, gender, tone, etc)


While other think audiobooks are too clean, I definitely think you could do this and then tune with more noisy data.

I’m also sure you could add subtitles for hard of hearing as a very noisy data set.


That might work in English but reading & speaking isn't necessary done the same way, that's going to completely fail in French for example.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: