Wow not a single mention of Whisper this entire comment first page! I think Whisper is really cool: the large model can pull speech out of even heavily distorted (wind noise, clipping, etc) audio. I have a story to illustrate why running Whisper on your own locally is not so easy! Much easier to sign up to the OpenAI API.
In my research I found that actually pre-processing the audio to reduce noise (using the IMO best-in-class FB research "denoiser") actually increases WER. This was surprising! From a human perspective, I assumed bringing up the "signal" would increase accuracy. But it seems that, from a machine perspective, there's actually "information" to be gleaned from the heavily distorted noise part of the signal. To me, this is amazing because it reveals a difference in how machines vs humans process. The implication is that there is actually speech signal that is inside the noise, as if voice has bounced off and interacted with the noise source (wind, fan, etc), and altered those sounds, left its impression, and that this information is then able to be utilized and contributes to the inference. Incredible!
With whisper: I started with the standard python models. They're kind of slow. I tried compiling python into a single binary using various tools. That didn't work. Then I found whisper.cpp--fantastic! A port of whisper to C++ that is so; much; faster. Mind blowing speed! Plus easily compilation. My use case was including transcription in a private, offline "transcribe anything" MacOS app. Whisper.cpp was the way to go.
Then I encountered another problem. What the "Whisperists" (experts in this nascent field, I guess) call it "hallucination". The model will "hallucinate". I found this hilarious! Another cross-over of human-machine conceptual models, our forever anthropomorphizing everything effortlessly. :)
Basically hallucination includes: feed Whisper a long period of silence, and the model is so desperate to find speech, it will infer (overfit? hallucinate?) speech out of the random background signal of silence / analog silence / background noise. Normally this presents as a loop of repeats of previous accurate transcribed phrase. Or, with smaller models, some "end-of-youtube video" common phrases like "Thank You!" or even "Thanks for Watching". I even got (from one particularly heavily distorted section, completely inaccurate) "Don't forget to like and subscribe!" Haha. But the larger models produce less hallucinations, and less generic "oh-so-that's-what-your-dataset-was!" hallucinations. But they do still hallucinate. Especially during silent sections.
At first, I tried using ffmpeg to chop the audio into small segments, ideally partitioned on silences. Unfortunately ffmpeg can only chop it into regular size segments, but it can output silence intervals, and you can chop around that (but not "online" / real time) as I was trying to achieve. Removing the silent segments (even the imperfect metric of "some %" of average output signal magnitude (sorry for my terminology, I'm not expert in DSP/audio)) drastically improved Whisper performance. Suddenly it went from hallucinating during silent segments, to perfect transcripts.
The other problem with silent segments is the model gets stuck. It gets "locked up" (spinning beach ball, blue screen of death style--I don't think it actually dies, but it spends a long, disproportionately long, time on segments with no speech. Like I said before, it's so cute that it's so desperate to find speech everywhere, it tries really hard, and works its little legs of during silence, but to no avail.
Anyway, moving on to the next problem: the imperfect metric of silence. This caused many issues. We were chopping out quieter speech. We were including loud background noise. Both these things caused issues: the first obvious, the second, the same as we faced before: Whisper (or Whisper.cpp) would hallucinate text into these noise segments.
At last, I discovered something truly great! VAD. Voice Activity Detection is another (normally) AI technique that allows segmenting audio around voice segments. I tried a couple Python implementations in standard speech toolkits, but none were that good. Then I found Silero VAD: an MIT licensed (for some model versions), AI VAD model. Wonderful!
Next problem was it was also in Python. And I needed it to be in C++. Luckily there was a C++ example, using ONNX runtime. (I had no idea any of these projects or tools existed mere weeks ago, and suddenly I'm knee deep!). There were a few errors, but I got rid of the bugs, and had a little command line tool from a minimal C++ build of ONNXruntime / Protobuf-Lite and the model. Last step was the ONNX model needed to be converted to ORT format. Luckily there's a handy Python script to do this inside the Python release of ONNXruntime. And, now, the VAD was super fast.
So i put all these pieces together: ffmpeg, VAD, whisper.cpp and made a MacOS app (with the correct signing and entitlements of course!) to transcribe English text from any input format: audio or video. Pretty cool, right?
Anyway, running Whisper on your own locally is not so easy! Much easier to sign up to the OpenAI API.
Thanks! So, OK, if you write a review and provide your honest feedback about this on the App store, I will definitely consider doing that! Sound like a bad idea? :)
Thank you sir! I will look for it. Stay tuned for updates! I might just consider putting that in soonish :) But I might not. I don't know. Can't guarantee anything about it right now. Thank you! asking and telling me about this
In my research I found that actually pre-processing the audio to reduce noise (using the IMO best-in-class FB research "denoiser") actually increases WER. This was surprising! From a human perspective, I assumed bringing up the "signal" would increase accuracy. But it seems that, from a machine perspective, there's actually "information" to be gleaned from the heavily distorted noise part of the signal. To me, this is amazing because it reveals a difference in how machines vs humans process. The implication is that there is actually speech signal that is inside the noise, as if voice has bounced off and interacted with the noise source (wind, fan, etc), and altered those sounds, left its impression, and that this information is then able to be utilized and contributes to the inference. Incredible!
With whisper: I started with the standard python models. They're kind of slow. I tried compiling python into a single binary using various tools. That didn't work. Then I found whisper.cpp--fantastic! A port of whisper to C++ that is so; much; faster. Mind blowing speed! Plus easily compilation. My use case was including transcription in a private, offline "transcribe anything" MacOS app. Whisper.cpp was the way to go.
Then I encountered another problem. What the "Whisperists" (experts in this nascent field, I guess) call it "hallucination". The model will "hallucinate". I found this hilarious! Another cross-over of human-machine conceptual models, our forever anthropomorphizing everything effortlessly. :)
Basically hallucination includes: feed Whisper a long period of silence, and the model is so desperate to find speech, it will infer (overfit? hallucinate?) speech out of the random background signal of silence / analog silence / background noise. Normally this presents as a loop of repeats of previous accurate transcribed phrase. Or, with smaller models, some "end-of-youtube video" common phrases like "Thank You!" or even "Thanks for Watching". I even got (from one particularly heavily distorted section, completely inaccurate) "Don't forget to like and subscribe!" Haha. But the larger models produce less hallucinations, and less generic "oh-so-that's-what-your-dataset-was!" hallucinations. But they do still hallucinate. Especially during silent sections.
At first, I tried using ffmpeg to chop the audio into small segments, ideally partitioned on silences. Unfortunately ffmpeg can only chop it into regular size segments, but it can output silence intervals, and you can chop around that (but not "online" / real time) as I was trying to achieve. Removing the silent segments (even the imperfect metric of "some %" of average output signal magnitude (sorry for my terminology, I'm not expert in DSP/audio)) drastically improved Whisper performance. Suddenly it went from hallucinating during silent segments, to perfect transcripts.
The other problem with silent segments is the model gets stuck. It gets "locked up" (spinning beach ball, blue screen of death style--I don't think it actually dies, but it spends a long, disproportionately long, time on segments with no speech. Like I said before, it's so cute that it's so desperate to find speech everywhere, it tries really hard, and works its little legs of during silence, but to no avail.
Anyway, moving on to the next problem: the imperfect metric of silence. This caused many issues. We were chopping out quieter speech. We were including loud background noise. Both these things caused issues: the first obvious, the second, the same as we faced before: Whisper (or Whisper.cpp) would hallucinate text into these noise segments.
At last, I discovered something truly great! VAD. Voice Activity Detection is another (normally) AI technique that allows segmenting audio around voice segments. I tried a couple Python implementations in standard speech toolkits, but none were that good. Then I found Silero VAD: an MIT licensed (for some model versions), AI VAD model. Wonderful!
Next problem was it was also in Python. And I needed it to be in C++. Luckily there was a C++ example, using ONNX runtime. (I had no idea any of these projects or tools existed mere weeks ago, and suddenly I'm knee deep!). There were a few errors, but I got rid of the bugs, and had a little command line tool from a minimal C++ build of ONNXruntime / Protobuf-Lite and the model. Last step was the ONNX model needed to be converted to ORT format. Luckily there's a handy Python script to do this inside the Python release of ONNXruntime. And, now, the VAD was super fast.
So i put all these pieces together: ffmpeg, VAD, whisper.cpp and made a MacOS app (with the correct signing and entitlements of course!) to transcribe English text from any input format: audio or video. Pretty cool, right?
Anyway, running Whisper on your own locally is not so easy! Much easier to sign up to the OpenAI API.
MacOS APP using Whisper (C++) and VAD0--conveniently called: WisprNote heh :) https://apps.apple.com/app/wisprnote/id1671480366