I do a lot of dictation on mobile devices for work, with middling, and perhaps more importantly, frustrating results (needless to say we are working on programming our way out of that hole). It is an area ripe for open source progress given the failure of larger companies with large proprietary data sets to make basic common sense decisions in their transcription algorithms, together with no way to provide impactful feedback.
If anybody is interested, there is definitely a market for a more robust dictation library that can be integrated into apps and works offline. It just needs to be professional - e.g. allow for preferences including the ability to indicate a strong preference for standard language and grammar over all slang, not forcing Title Case for anything resembling a brand name, having a training mode for words and phrases of the user's choosing, blacklisting of certain word or phrase results which are false positives, and proper learning from user corrections during use so the tedium of correcting the same phrase 100 times disappears.
There's also a huge need for better transcription software in radiology. Existing ones are expensive and are just not good enough to be an actual time-saver.
(A radiologist friend describes switching from a human medical transcriptionist to one of these "AI" software thingies as a cost-cutting measure by his hospital, or more accurately as a cost-offload measure. Hospital offloads salary, radiologist spends more time correcting stupid transcription mistakes for no extra pay).
I thought it would be faster to correct a transcript than type it up fresh. I can type pretty damn quickly, but still slower than the average person talks. I would rather follow along to an AI transcript and correct a few words than type it all new.
Problem is that transcription errors never have a misspelling and usually sound ok in yourself nene voice. You have to really pay attention to find them (reading the text backwards a sentence at a time sometimes helps with this)
Imagine a professional “assisted stenography” software package. Rather than automatic full transcription + proof-reading after the fact, it would transcribe “live” but with any words where its model gave a low-confidence output highlighted with the expectation that the human stenographer, listening to the audio at the same time the machine is, will type in the correct transcription (or just hit tab to accept, like in autocomplete.)
I would love it if I could always see the top three possible interpretations of what I just said to my phone. It's interesting watching the speech recognition think about what's being said and flip words around once it has enough context.
I have a (probably dumb) question: why don't we just use audiobooks for this? There are thousands and thousands of hours where the transcripts were written, and then read aloud. Some of them are now public domain. I'm sure some validation would need to be done, but it seems like there would be an endless validation set there. Am I missing something obvious?
But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak, especially if that language comes from very old texts (which many public domain books are indeed quite old).
Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.
Third, the tone and cadence of read speech is different than that of spontaneous speech (the Common Voice dataset also has this problem, but they are coming up with ideas on how to prompt for spontaneous speech too).
But the goal of Common Voice was never to replace LibreSpeech or other open datasets (like TED talks) as training sets, but rather to compliment them. You mention transfer learning. That is indeed possible. But it's also possible to simply put several datasets together and train on all of them from scratch. That is what Mozilla's DeepSpeech team has been doing since the beginning (you can read the above hacks blog post from Reuben Morais for more context there).
> Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.
It shouldn't be that hard to degrade the quality synthetically? And with a clean source you can synthesize different types of noise/distortions.
I can't speak for voice data, as I've not worked with voice, but I did my MSc on various approaches for reducing error rates for OCR. I used a mix of synthetically degraded data ranging from applying different kinds of noise to physically degrading printed pages (crumpling, rubbing sand on them, water damage), and while it gave interesting comparative results between OCR engines, the types of errors I got never closely matched the types of errors I got from finding genuine degraded old books. I've seen that in other areas too.
My takeaway from that was that while synthetic degradation of inputs can be useful, and while it is "easy", the hard part is making it match real degradation closely enough to be representative. It's often really hard to replicate natural noise closely enough for it be sufficient to use those kind of methods.
Doesn't mean it's not worth trying, but I'd say that unless voice is very different it's the type of thing that's mostly worth doing if you can't get your hands on anything better.
You might say, if you can identify and simulate all cases of real life degradation, your problem is basically solved, just reverse the simulation on your inputs.
I’m not saying ocr isn’t hard. I’m saying normalizing all those characters basically is the problem.
> And with a clean source you can synthesize different types of noise/distortions.
You'd have to know what the most common types of noise are, how they interact with the signal, etc. This method of collecting data can provide useful info on what that noise actually is.
tone and cadence of read speech is different than that of spontaneous speech
I don't think most people speak to their phone the same way they normally speak.
For example, I always speak slowly, with perfect pronunciation and intonation when talking to Siri.
> But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak,
The problem with the 'problem' you're describing is the scope of speech recognition is being defined too narrowly.
If all you care about is creating an open source Alexa/Siri knockoff, then yes you need to recognize conversational speech and much else. But what if you do want to recognize scripted rehearsed speech? What if you want a speech recognizer that can auto-transcribe movies, news broadcasts, or in fact audio books? Wouldn't it be nice if all audiobooks came with aligned text? That's an experience you can get right now with kindle/audible, but as far as I'm aware no FOSS ebook reading software supports it. If I have a public domain reading of Tom Sawyer from LibreVox and a text copy of Tom Sawyer from Project Gutenberg, how many hoops do I currently have to jump through to get the speech highlighted on my screen as the audiobook plays?
Recognizing all forms of speech should be the goal, not just one narrow albeit trendy sliver of speech.
When training speech recognition systems you want to use data that closely matches your target domain. Models trained on audiobooks read by professionals will not perform very well for transcribing conversational or spontaneous speech or if there is background noise.
But, if I understand correctly, systems can be trained separately on "this is background noise" and then apply those filters first, and then work with cleaned audio, right? I've been using krisp.ai for a few weeks and it has been fantastic at doing exactly that in real-time.
Regarding conversational speech, I get that. Books are definitely not conversational.
I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>
You're right, these issues can also be tackled independently. Transfer learning can help, but my first guess would be that it's hard to get reasonable accuracy (= usable for applications) without hundreds of hours of conversational data. You could also attempt to directly modify the audiobook data by manually adding noises or performing other distortions.
In any case, for read speech in particular there are several corpora out there already, including the moderately large LibriSpeech corpus (1000hr). The state-of-the-art accuracy on read speech is also very good -- for example, domain-specific dictation systems have been commercially viable for quite some time. So while it's true that Audiobooks are a large untapped source, I think that there are other large-scale and richer options like YouTube or movies (i.e. videos with speech for which subtitles are available) that would be more useful to make progress towards good speech recognition systems.
Self reply with more questions/thoughts. based on what I know, it seems like the problem could break down as:
1. we have a lot of training data for the voices of white men reading stuff.
2. We have good models that already exist for removing background noise.
3. We might be able to build good models that could identify accents, gender, age variation.
4. we have good models for style transfer that work in the audio domain.
Could we take an audiobook read by a white guy, and use a style transfer model to give him a german accent, and then use the german accented version as training data back into the speech recognition model? Could you use a reverse style transfer model to turn accented audio into non-accented audio (i.e. normalize it all to the place where we have the most training data) Could we use a combination of style transfer models to vastly expand the training data set, and then train the conversational systems?
Or, are the style transfer models not good enough? Or do we not have training data for style transfers to turn the voices of white men into the voices of white men with german accents?
I don't want to trivialize, but I'm genuinely curious how professionals are actually trying to solve this now?
> I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>
Understanding all words is not the problem. I don't know if it's universal, but frequently, a speech-to-text model is actually two models: A voice model (mapping raw audio to phonemes) and a language model (which models what the language looks like, i.e. what sentences are likely and which words exist). So if you want the STT system to understand novels, include novels in the training data for the language model. You can then combine it with a voice model suitable for conversational speech/the user's accent/background noise.
Transfer learning is not guaranteed to work well. Most learned features even in the first layers look usually very different if trained in a clean environment. Background noise is not just a simple stationary signal, but very different audio patterns like music or other voices.
Audiobooks have a very small number of accents. Often they’re recorded by middle-aged, or older men with a standard accent. (Think newscasters, and most actors)
One of the advantages of CommonVoice is that breadth of voices. By having more, and different types of voices, you can build a more robust system, which works for more people.
For instance, my wife is a nonnative English speaker, and had trouble for years using Siri. Alexa can’t understand small children. And of course, elevators can’t understand Scottish accents. https://m.youtube.com/watch?v=NMS2VnDveP8
Speaking and reading out loud are vastly different. Try it yourself: record yourself explaining a concept with zero preparation and then read a text explaining the concept.
Also (I am not a lawyer but) just because the source material is out of copyright doesn't mean the audiobook is out of copyright.
The parent comment was presumably talking about the massive library of unencumbered audio books based on those books, such as those created by LibriVox.
You need to line up the timestamps for each spoken sentence to each written sentence. Some data sources, like Amazon's "Whisper Sync" can do this already.
It's great to see innovation in the space of open data.
There have recently been a number of assertions that better quality ML data will outperform better ML algorithms, and this has certainly been true in my experience as well, especially in domains like speech recognition.
There's going to be a long road to catch up to the big players, however. Even 15 years ago there were companies who were doing 1M minutes of labled voice data per year.
The data gap between established players and newcomers to the market will continue to grow unless we invest in efforts like this.
Original Deep Speech 2 paper released few years ago mentioned a hundred thousand or two hundred thousand hours, but the amount of data has increased since then significantly.
Still quite a lot of languages have very tiny datasets of transcribed data.
I was decided to spend 10-ish minutes validating voices, because why not.
I did not hear any lines that were flat-out spoken incorrectly, at least as far as I could tell. However, I did come across a ton of really poor samples, to the point of being somewhat difficult to understand. Things like:
• Really strong accents
• Horrible, muffled microphones
• Background noise
• Super quiet
• A couple "robotic" samples I legitimately think were generated via text-to-speech software
All of these types of samples—save the last—constitute possible real-world scenarios. But, do they make for good training data? I very little about machine learning, but it makes logical sense to me that you'd want to teach the computer with "clean" data—something with a high signal-to-noise ratio which is as close to the "average" of the real world as possible. Is this completely wrong?
Separately, they ought to provide this type of instruction on what to do with borderline samples. If I legitimately can't tell for sure whether a word was spoken correctly, what should I do?
Yes, you do want the bad voice samples precisely because they correspond to actual input that a model might receive. A dataset with only clear samples would likely have a stronger "signal" overall meaning it might be easier for an academic testing a model to get higher test accuracy training on that data. The bad data makes the model more robust to non-ideal input types
In fact, it's not uncommon to take a dataset and deliberately, randomly distort it- for images, things like scaling, rotating, cropping, blurring, altering color balance and gamma, flipping, adding random noise...
The idea is to make the model resistant to that "bad" input and effectively enlarge your dataset for free- if you have a picture of a cat, you can automatically also get loads of pictures that you know should still be classified as a cat- rotated 15 degrees clockwise, noisy (like in low-light conditions), the tip of its tail is out of frame, the camera's automatic white balance screws up...
Also, the robotic samples may be real human voices mangled by LPC- Think a lossy VoIP call.
In reading the comments here I see that some people do use voice with their hand rectangles for other purposes than speaking to someone. I never do even though there is some Google Assistant icon staring at me. If I accidentally start recording my voice by fumbling buttons then I instinctively try to stop it so I can continue pecking into the keyboard.
Is it a generational thing and will future generations find typing into a search box as anachronistic as I find using a land line with a rotary dial?
I am also British and therefore not as loud as some people in the English speaking world. Talking to my phone on the train would make me cringe. Clearly people like me will die off soon enough, however, is adoption a problem for these voice technology things? How do people get into changing habits from pecking at a keyboard to the evidently easier voice driven way of doing things? Is there one use case, e.g. in the car, where the habit of speaking to a gadget is learned?
When I was one-handed for a while after surgery, I would use my phone's speech-to-text function often to send emails. Far faster than one-handed typing.
For a while i had voice commands working/enabled on my phone, and it was nice being able to read messages and reply to them while commuting, without taking my hands off the wheel or my eyes from the road. Had the accuracy been better, I'd probably be using it yet.
I’ve been contributing to Common Voice for several months now. If anyone else is thinking of making contributions, it’s worth mentioning that there are a lot more speakers than validators and English currently has a 1 year validation backlog, so new validators are more useful right now than new speakers.
Interesting! I started doing some validation a while ago then stopped because I figured it would be the other way around and I wasn't prepared to speak. I will start validating again!
I'm puzzled by the "You agree to not attempt to determine the identity of speakers in the Common Voice dataset".
On some level it's a good idea to want to request this, but as the dataset is public-domain, isn't it going to get mirrored and retrieved by people who won't have to agree to anything? ...
It has no legal meaning whatsoever since, indeed, it's public domain. Or maybe it can affect the downloader, but certainly not anyone who got the data from the downloader (without such promise). I think the point is to remind people that they aren't cool with it and that because you can (CC0) it doesn't mean you should.
As an update, in looking at the code it seems they hardcode the choice of MP3 for storage in the server.
Also very weird that they serve the tarball of MP3s gzipped, which seems mostly pointless to me, as it amounts to a reduction of maybe ~4% for a tremendous amount of time spent uncompressing the tarball (which itself has a bunch of useless macOS-specific headers, three apiece on each file).
Many of the MP3 files are literally just empty (zero size) or partially written (corrupt), it seems. I wonder if that issue comes from their choice to package the tarballs on macOS, or some underlying issue on the server side.
I think its on purpose to not give too much advantage to downloaders, there was a discussion long time ago, but they still continue with mp3. There are also other not so nice points: https://github.com/kaldi-asr/kaldi/issues/2141
This is amazing! I've been using LJS trained models and then cross-training them to target speakers, but this looks like it may produce even higher quality results.
I've previously implemented concatenative TTS using unit selection [1]. The quality is spotty, so I'm throwing it out and going with the ML approach, which produces higher fidelity voices even in my own experiments.
My next steps are taking an end-to-end synthesis model and porting it to run cheaply on the CPU.
Thanks so much for making this data available, Mozilla! You're helping democratize this technology for individual engineers and researchers that don't have Google's resources.
I'm interested in using this for text to speech, rather than speech to text. Is wavenet still the state of the art for training on a dataset like this?
The biggest feature of this data set is that it includes lots of different accents and non-native speakers. This should improve voice recognition in these areas, but I'm not sure if it is that usful for voice generation.
It would be super rad if someone plugged this dataset into CMU Sphinx. Google is the only decent ASR and a competitive open source alternative would be awesome. A big thank you to Mozilla for this dataset.
I do a lot of dictation on mobile devices for work, with middling, and perhaps more importantly, frustrating results (needless to say we are working on programming our way out of that hole). It is an area ripe for open source progress given the failure of larger companies with large proprietary data sets to make basic common sense decisions in their transcription algorithms, together with no way to provide impactful feedback.
If anybody is interested, there is definitely a market for a more robust dictation library that can be integrated into apps and works offline. It just needs to be professional - e.g. allow for preferences including the ability to indicate a strong preference for standard language and grammar over all slang, not forcing Title Case for anything resembling a brand name, having a training mode for words and phrases of the user's choosing, blacklisting of certain word or phrase results which are false positives, and proper learning from user corrections during use so the tedium of correcting the same phrase 100 times disappears.