Hacker News new | past | comments | ask | show | jobs | submit login

Transcript of 20230302 170602.m4a (at 2023-03-02 1711.02).txt

   Hey friend, your use case sounds really interesting.
   Actually that's why I created this app initially.
   I really love riding around on my bike in the city
   and doing voice memo debriefs about whatever.
   I also like to do it walking around outside.
   And as you say, the trouble with that is wind distortion.
   Full stop.
   On a day where it's not too windy, it's not too bad.
   These models can totally pull the text from it.
   But the more distortion you have,
   the more of a disaster it is.
   And I don't know anything about the multilingual case,
   but for English, I definitely find that
   small is more than enough if you have good quality audio.
   Medium, you might wanna use it
   if there's some kind of distortion
   that's causing errors in the small.
   But if you have really good quality audio,
   even tiny is enough.
   I mean, it won't get some sort of rare words.
   So small is basically good enough for English anyway.
   Aligning with what you said,
   I remember seeing in the whisper paper
   that the performance actually decreases,
   the word error rate increases from the medium
   to the large model in the multilingual case,
   which is kind of interesting.
   So basically medium, I think, is all you really need.
   I think doing large, running large locally
   is probably a waste of time.
   But this doesn't apply to the OpenAI API case
   because they're running their own sort of special model.
   It's very fast.
   Plus they're kind of going to be retraining it
   so continuing to improve it over time.
   So obviously there's that, which is cool.
   I think basically I did extensive research
   and experiment with this,
   with trying to clean up audio for the transcription.
   And there's basically no way to do it.
   Like if you have a medium to bad level of noise
   that the transcription models can still work with,
   you're fine.
   Just go with that.
   But in that case,
   there's no point actually trying to denoise the sound first.
   That just seems to reduce the signal
   and it actually increases the word error rate.
   So just give them the raw distorted, windy audio
   and the models will do the best they can.
   You can't actually improve it, I found.
   I tried all kinds of different ways to process it
   and none of it actually improved it,
   including like the best possible denoiser I could find,
   which is the Facebook research denoiser.
   So my conclusion was that, okay,
   I found a sort of a fundamental physical limit
   and I think using denoising is really only good for humans.
   Like if you want to listen to the audio again,
   you don't want to hear all that wind probably.
   And for medium to bad,
   but not extreme levels of wind distortion
   or other kinds of noise distortion,
   you can use a denoiser like the Facebook research one
   and that will totally or nearly totally
   kind of reduce all that noise.
   But I basically decided that the only way
   to kind of get better quality audio
   or to get better quality transcripts,
   if you're doing it outside on a windy day,
   is not to go with software enhancement
   because it doesn't do anything, it doesn't achieve anything.
   I tried everything possible
   and nothing produced results in the extreme distortion case.
   So what I decided is that's basically a limit,
   physical limit and so the best way to do it I think
   is to change your microphone setup,
   have some sort of baffle around it,
   maybe one of those little microphones
   that can clip close to your throat or,
   I don't know, I'm no expert,
   but I think basically you can find a way
   to get better quality, less distorted audio
   outside by changing the microphone setup,
   adding one of those wind baffles or something.
   And I think that's basically all you can do essentially.
   And then once you have reduced
   a lot of that noise distortion,
   even if the resulting audio is still distorted,
   if it's not too bad, the models can be okay with it.

End of Transcript (created with WisprNote)



I'm not preprocessing the recording. The thing I mentioned with considering VAD, if that's what you're referring to, is that I should somehow trim the files.

In order to minimize my interaction with my phone during the bike ride, I press a button which records 1 minute of audio. If I know that I need more time, I press it again before the minute ends, this then starts a second recording in parallel which also lasts one minute. So I just have to press a button and can forget about it. This is because I noticed that I usually don't require more than one minute to record a thought, and if I have multiple, I can put them in multiple files.

But since my recordings then usually consist of 20 seconds of audio, the 30 seconds at the end are only silence (with wind and tire noise). Whisper splits the files into 30 second segments, and apparently tries to find voice in each segment, so the remaining one which has no voice causes Whisper issues, where it starts hallucinating. This is why I would like to trim the files.

I now noticed that the service doesn't add punctuation and capitalization, so the funny thing is that I took that output and posted it into ChatGPT like this: "Correct the following: '[text from whisper]'", and it does an incredible job of fixing even words which Whisper erred on.

-

Whisper:

ich habe gestern erste tests mit open ai whisper gemacht um nozizen [sic!] zu transkribieren

[ Yesterday I did my first tests with open ai whisker to transcribe nozizen [sic!]. ]

es waren teilweise recht gute ergebnisse vor allem mit medium

[ there were some really good results, especially with medium ]

latsch [sic!] natürlich besser aber da sind die anforderungen zu hoch

[ latsch [sic!] better of course, but the demands are too high ]

wenn ich da einen server draus mache könnte ich mal eine zeit lang ausprobieren ob sich das lohnt

[ If I make a server out of it I could try it out for a while to see if it's worth it ]

auch für anrufe der anruf der antworten

[ also for calls the call of the answers ]

-

then ChatGPT:

Ich habe gestern erste Tests mit OpenAI's "Whisper" gemacht, um Notizen zu transkribieren. Die Ergebnisse waren teilweise recht gut, vor allem mit "Medium". "Large" funktioniert natürlich besser, aber die Anforderungen sind zu hoch. Wenn ich einen Server dafür bereitstelle, könnte ich mal für eine Zeit lang ausprobieren, ob sich das lohnt, auch für Anrufe und Antworten.

[ Yesterday I made first tests with OpenAI's "Whisper" to transcribe notes. The results were sometimes quite good, especially with "Medium". "Large" works better, of course, but the requirements are too high. If I provide a server for it, I could try it out for a while to see if it's worth it, also for calls and answers. ]

I'm sorry that this is in German, but I don't have anything in English I've been testing on.


This is great! Thank you. We’re very similar actually: I also tried getting chat GPT to correct transcripts for errors but when I tried like 3 weeks ago it couldn’t manage. I just use voice memos app and let it run. I just talk and don’t think about the file. So I only have 1 file and my WisprNote app removes nearly all the non speech and passes it to whisper. I think there’s actually a voice memo setting on MacOS that will cut silence automatically but I don’t use it.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: