Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's also annoying since there appears to be a hard limit of 25 MiB to the request size, requiring you to split up larger files and manage the "prompt" to subsequent calls. Well, somehow, near as I can tell, how you're expected to use that value isn't documented.


You split up the audio and send it over in a loop. Pass in the transcript of the last call as the prompt for the next one. See item 2 here: https://platform.openai.com/docs/guides/speech-to-text/promp...


And:

> we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.

That's really easy to put in a document, much harder to do in practice. Granted, it might not matter much in the real world, not sure yet.

Still, this will require more hand holding than I'd like.


I doubt it will matter if you're breaking up mid sentence if you pass in the previous as a prompt and split words. This is how Whisper does it internally.

It's not absolutely perfect, but splitting on the word boundary is one line of code with the same package in their docs: https://github.com/jiaaro/pydub/blob/master/API.markdown#sil...

25MB is also a lot. That's 30 minutes to an hour on MP3 at reasonable compression. A 2 hour movie would have three splits.


If that helps, just wrote a script to split the audio and use the prompt parameter to provide context with the n-1 segment transcription: https://gist.github.com/patrick-samy/cf8470272d1ff23dff4e2b5...


The page includes a five line Python example of how to split audio without breaking mid-word.


I suggest you give revoldiv.com a try, We use whisper and other models together. You can upload very large files and get an hour long file transcription in less than 30 seconds. We use intelligent chunking so that the model doesn't lose context. We are looking to increase the limit even more in the coming weeks. It's also free to transcribe any video/audio with word level timestamps.


I just gave it a try, and the results are impressive! Do you also offer an API?


If you're interested in an offline / local solution: I made a Mac App that uses Whisper.cpp and Voice Activity Detection to skip silence and reduce Whisper hallucinations: https://apps.apple.com/app/wisprnote/id1671480366

If it really works for you, I can add command line params to an upate, so you can use it as a "local API" for free.


contact us at team@revoldiv.com and we are offering an API on a case by case basis




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: