Hacker News new | past | comments | ask | show | jobs | submit login

You split up the audio and send it over in a loop. Pass in the transcript of the last call as the prompt for the next one. See item 2 here: https://platform.openai.com/docs/guides/speech-to-text/promp...



And:

> we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.

That's really easy to put in a document, much harder to do in practice. Granted, it might not matter much in the real world, not sure yet.

Still, this will require more hand holding than I'd like.


I doubt it will matter if you're breaking up mid sentence if you pass in the previous as a prompt and split words. This is how Whisper does it internally.

It's not absolutely perfect, but splitting on the word boundary is one line of code with the same package in their docs: https://github.com/jiaaro/pydub/blob/master/API.markdown#sil...

25MB is also a lot. That's 30 minutes to an hour on MP3 at reasonable compression. A 2 hour movie would have three splits.


If that helps, just wrote a script to split the audio and use the prompt parameter to provide context with the n-1 segment transcription: https://gist.github.com/patrick-samy/cf8470272d1ff23dff4e2b5...


The page includes a five line Python example of how to split audio without breaking mid-word.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: