Hacker News new | past | comments | ask | show | jobs | submit | punchingwater's comments login

Audiobooks are definitely possible for ASR training. Indeed the largest open ASR training dataset before Common Voice was LibriSpeech (http://www.openslr.org/12/). Also note, the first release of Mozilla's DeepSpeech models were trained and tested with LibriSpeech: https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error...

But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak, especially if that language comes from very old texts (which many public domain books are indeed quite old).

Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.

Third, the tone and cadence of read speech is different than that of spontaneous speech (the Common Voice dataset also has this problem, but they are coming up with ideas on how to prompt for spontaneous speech too).

But the goal of Common Voice was never to replace LibreSpeech or other open datasets (like TED talks) as training sets, but rather to compliment them. You mention transfer learning. That is indeed possible. But it's also possible to simply put several datasets together and train on all of them from scratch. That is what Mozilla's DeepSpeech team has been doing since the beginning (you can read the above hacks blog post from Reuben Morais for more context there).


> Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.

It shouldn't be that hard to degrade the quality synthetically? And with a clean source you can synthesize different types of noise/distortions.


I can't speak for voice data, as I've not worked with voice, but I did my MSc on various approaches for reducing error rates for OCR. I used a mix of synthetically degraded data ranging from applying different kinds of noise to physically degrading printed pages (crumpling, rubbing sand on them, water damage), and while it gave interesting comparative results between OCR engines, the types of errors I got never closely matched the types of errors I got from finding genuine degraded old books. I've seen that in other areas too.

My takeaway from that was that while synthetic degradation of inputs can be useful, and while it is "easy", the hard part is making it match real degradation closely enough to be representative. It's often really hard to replicate natural noise closely enough for it be sufficient to use those kind of methods.

Doesn't mean it's not worth trying, but I'd say that unless voice is very different it's the type of thing that's mostly worth doing if you can't get your hands on anything better.


You might say, if you can identify and simulate all cases of real life degradation, your problem is basically solved, just reverse the simulation on your inputs.

I’m not saying ocr isn’t hard. I’m saying normalizing all those characters basically is the problem.


This isn't quite true if e.g. there are degenerate cases.


> And with a clean source you can synthesize different types of noise/distortions.

You'd have to know what the most common types of noise are, how they interact with the signal, etc. This method of collecting data can provide useful info on what that noise actually is.


tone and cadence of read speech is different than that of spontaneous speech

I don't think most people speak to their phone the same way they normally speak. For example, I always speak slowly, with perfect pronunciation and intonation when talking to Siri.


I think the goal is to let speak to your phone normally, not to continue having to speak like your phone is hard of hearing.


> But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak,

The problem with the 'problem' you're describing is the scope of speech recognition is being defined too narrowly.

If all you care about is creating an open source Alexa/Siri knockoff, then yes you need to recognize conversational speech and much else. But what if you do want to recognize scripted rehearsed speech? What if you want a speech recognizer that can auto-transcribe movies, news broadcasts, or in fact audio books? Wouldn't it be nice if all audiobooks came with aligned text? That's an experience you can get right now with kindle/audible, but as far as I'm aware no FOSS ebook reading software supports it. If I have a public domain reading of Tom Sawyer from LibreVox and a text copy of Tom Sawyer from Project Gutenberg, how many hoops do I currently have to jump through to get the speech highlighted on my screen as the audiobook plays?

Recognizing all forms of speech should be the goal, not just one narrow albeit trendy sliver of speech.


the article states:

> “But Reitze counters that the complete data from that first run is already available online. According to Shoemaker, this includes the relevant time series data and the programs used, but "it's not a trivial matter to use them." Caltech even held a training workshop on how to deal with gravitational-wave data. That's a pretty far cry from asking the physics community to take its analysis on faith...”

Are you refuting this statement? To me, it looks like the data/analysis are open, but not yet independently verified due to difficult nature of problem space.


I used to work in particle/astrophysics. By and large people weren't opposed to sharing their data and software. The degree of specific expertise it took to get results was staggerng though. It takes significant technical infrastructure and expertise to even just rerun an analysis on a big corpus of data. At that stage, you haven't even validated squat, just pushed some buttons to run other people's logic. Doing this properly takes person years of effort with little to no reward.

In practice, there's usually at least two big experiments (not necessarily quite of the same generation) that were built by different groups which corroborate results. This is currently the best defense against big mistakes.


I did not read the article till the very last paragraph. My bad. Of course, I do not refute it. So in this particular case they went open source.

However, from my experience in years of publishing to and reviewing for several APS journals, I stand by my statement. Because of 'publish or perish' nobody wants to give up any 'competitive edge' of their particular research. So there is no incentive system in place to foster an open source everything attitude.


This is changing. Slowly, as it seems to be mostly a generational issue.

The younger scientists, current PhD generation and a little up are all fed up with closed source/private data as there are no enough bright examples how good open source/data can work out.


Just to add my two cents (I work for Mozilla on Common Voice): without help from linguists, Common Voice would have made some very different and very bad decisions about all sorts of things like: accents, dialect segmentation, corpus curation, licensing, and many other things. Linguistics were absolutely instrumental. We tried to thank some of them at the end of our blog post: https://medium.com/mozilla-open-innovation/more-common-voice...


Just to note, we will never require your email address to contribute. There will always be an anonymous contribution workflow.

But adding new languages to Common Voice is a bit complicated at the moment, and we haven't built a way to do this through the website yet. So for now, we are doing this through a very manual process, and we plan to use email addresses to communicate.


Thank you for bringing this up.

Indeed Common Voice is not for everyone. We try to make it clear in our Privacy Policy [1] what pieces of data we collect and why. We do not publish email address or names with the data, and we even strip speaker identification info (so that a speaker's recordings are not grouped but instead everyone's recordings go into one giant bucket). That said, if this still makes you feel uncomfortable, we understand. And if you would like to contribute without donating your voice, you can always validate the recordings of others.

1.) https://voice.mozilla.org/en/privacy


In the early days of this project, before we shipped the website (ie. ~March of 2017), we did some explorations around Mechanical Turk. The problem with the Mech Turk approach is that for recording your voices you need a lot of different people speaking (ie. 10s of thousands). But for languages other than English, Mech Turk simply doesn't have these kind of numbers. And indeed English is not that interesting to us, since there exists public data already in English (see LibriSpeech). There are of course other micro-task platforms popular in other countries (for instance, there's a myriad in Indonesia), but we didn't have the time to manage jobs on all these different platforms.

However, Mech Turk is better for things like validation, since you only need a handful of people doing the majority of work.

In any case, I have some very hacky tools we used for this exploration, if you are interested: https://github.com/mikehenrty/mech-turk/


There has been some discussion around this, but no real movement yet: https://github.com/mozilla/voice-web/issues/336


Would you mind filing an issue? https://github.com/mozilla/voice-web/issues


> Also... (too lazy to check right now) - if I create an account, can I see the 'yes/no' ratings of my own submissions?

Not yet, but this is something in the works. You can explore our new experience with the evergreen link: http://bit.ly/cv-desktop-ux


We do have a issue filed to allow users to tag recordings with certain metadata, like noisy or male/female voice. https://github.com/mozilla/voice-web/issues/814

It is something we are still working on.

> The other thing is that it's very cool to see the "you helped us reach out x% goal" thing but it locks up all the previous / next shortcuts which means I have to switch back to the mouse after 5 entries.

That's a bug! Would you mind filing one here: https://github.com/mozilla/voice-web/issues


> The other thing is that it's very cool to see the "you helped us reach out x% goal" thing but it locks up all the previous / next shortcuts which means I have to switch back to the mouse after 5 entries.

We have a workaround in place, btw: https://github.com/mozilla/voice-web/issues/1179#issuecommen...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: