We're using AssemblyAI too, and I agree that their transcription quality is good. But as soon as Whisper supports world-level timestamps, I think we'll seriously consider switching as the price difference is large ($0.36 per hour vs $0.9 per hour).
Both of those prices strike me as quite high, given that Whisper can be run relatively quickly on commodity hardware. It's not like the bandwidth is significant either, it's just audio.
It's pretty great from my perspective. I've been creating little supplemental ~10 minute videos for my class (using descript; i should probably switch to OBS), and the built in transcription is both wonderful (that it has it at all and is easy to fix) and horrible (the number of errors is very high). I'd happily pay a dime to have a higher quality starting transcription that saves me 5 minutes of fixing...
It has great quality transcription from video and audio (in English only sorry if that's not you!). Uses Whisper.cpp plus VAD to skip silent / non-speech sections which introduce errors normally. Give a try let me know what you think! :)