From the WER numbers alone it looks like a very small difference for English itself, but I've found WER to be a misleading assessment mechanism.
Having extensively tested Whisper v2 large against other 'lower WER' models and found them wanting (because of differences in their methodology for generating output), I'm super curious to get a feel for how v3 holistically behaves.
Czech pronunciation is extremely regular and straightforward (sounds close to Latin or even Italian) with no weird "which vowel was that" or "half the word is silent" features and just a few exceptions. Usually if you write a letter, you pronounce the sound, and if you hear a sound, you write the letter.
A great example is that — for most words from any language that uses a subset of the Czech alphabet — a Czech speaker can just pronounce the word instead of spelling it and another Czech speaker will be able to write it down.
e.g. "messerschmitt", "nešamas", "cadeira", "philosophy", "tastaturi", "nicchia", "kaupunki", "abordagem", "povjerilac", "primauté" are all foreign words with very unambiguous pronunciation in Czech.
I don't know Czech, but Italian is extremely consistent in the way it's written, so it's at the top of the list with about one or two orders of magnitude less data.
Czech pronunciation is actually very close to Italian (and both close to Latin). We don't do the "ce" and "ci" and "gn" things (we do a "di, ti, ni" thing instead), and we use diacritics to soften certain sounds (ž,š,č,ď,ň), but even ignoring all that and plowing right through, an Italian speaker pronouncing Czech text should be easily intelligible and even spot on for some words.
I'm more impressed about Korean! I didn't even realize it was that good in V2. But I've just seen a lot of systems perform really poorly (judged by my Korean gf not me) and Korea is only a country of 52M (between Spain and Italy).
A funny note, if Siri is set in Korean mode and reads your texts that come in as English, they sound like a racist imitation of a Korean accent. It is absolutely hilarious.
I also find funny how Portuguese is also better than English (Brazilian talking here). I guess is probably the nature of the languages or so, phonetics...
it does works amazing in PT-BR Whisper V2, I can't even imagine it being better, and turns out, V3 promises it to be better...
It looks like it's basically whisper-2 with extra training against datasets for specific languages that brought incidental improvements to the rest. Support for some of the languages is still really bad (from real-world experience).
Does anyone know of a nice UI wrapper for something like whisper.cpp?
I need to write a lot of long texts for work and some good dictation software would be great. I know there's Dragon, but somehow I have not been able to find something that fits my need and is free.
Still doesn't look like it can do real-time unfortunately.
Edit: I understand that you can use small samples and approximate something like streaming, but the limitation here is you wind up without context for the samples, increasing WER. It would be nice if there was some streaming option.
It's inefficient, but even older gaming GPUs are fast enough for real time performance, and accuracy is good. If you were going to train a model from scratch for real time you could do something more efficient, but it works as is.
Edit: I'm not sure what you mean by "you wind up without context for the samples". You can supply context to Whisper.
Unfortunately Microsoft's store doesn't allow requesting more than 6GB of VRAM in a store listing, so that's why it says 6 and 12 in the same listing. The lowest I've tested so far is a 3060 12GB. I wouldn't expect 6GB to work, but if you're willing to give it a try I'd be interested to know what happens.
I'd like to support less VRAM. Maybe a future version will offload some of the processing to the cloud.
whisper-cpp can do about 6x-10x realtime with the older Whisper model on my 2021 M1 Macbook. I use this to transcribe multiple-hour-long podcasts. The tiny model can easily do 30x realtime on this hardware.
A few providers have done a variant of live transcription that's similar to how old-school providers do it, where they transcribe a short window (i.e. XXXms at a time) and this is definitely the easiest path. One such provider is Gladia: https://www.gladia.io/
There are other ways too with different trade-offs, can e-mail me at the link in my profile if you'd like to talk about how.
I've been looking for this. Thanks for the recommendation.
I like startups where you can sign up, use it in seconds, integrate it in minutes.
(I am definitely inspired as we don't currently provide such a straightforward experience in our own signup flow)
I tried it in English, French, and my broken Spanish, and all 3 came out great. One surprising thing is that if you switch languages mid-transcription with the "single language" model, it will transcribe the second language and translate it at the same time, so the entire transcription is in a single language, but the meaning is preserved.
So... you have to operate in chunks, 30s at a time generally, although you get reduced time-per-chunk if there are zeroes in the chunk and there are variants.
The zeroes + faster model is how Gladia (mentioned in my other comment here) achieves live transcription by simply transcribing really short chunks one after the other, I believe.
For more advanced stuff you kinda have to get your hands dirty, which I've done for my own product (not linked).
Thanks for mentioning Gladia, this is not exactly how it works however, our version of Whisper is modified from the original one to avoid hallucinations, we are releasing a new model in a few days that is even better regarding this matter. Also worth mentioning 3 main problems that occurs when it comes to real-time - endpointing, context reinjection (while avoiding hallucinations - which is a main issue with whisper as prompt injection is generating hallucinations like a lot in general), and finally alignment. Timestamps are extremely important in real time if you want to realign with the original streamed audio. Whisper tends to be hard to handle in all these elements.
Interesting teaser. I thought there must be some way to better optimize the model for real time, but haven't dug in because it's decently fast as is and there's so much other stuff to work on. So many models, so little time!
As another commenter pointed out, you can give context for the decoder. So you can feed previous chunks into the model as the context. This is how we do it for streaming, at least.
I've built a Google Docs-like doc editor that has real-time Whisper transcription built in. It's not released yet, but message me if you'd like to try it out!
This is great, but I hope in the future there would be a speech-to-text model with a focus on low-resource languages, probably by balancing the dataset similar to No Language Left Behind (NLLB) released by Meta, it's a translation model that works really well even with low-resource languages, it would be really cool something similar for speech transcription.
That's a very confusing direction to have your own whisper versioning. It may be better to call the model something different to differentiate versioning.
There are 2 issues here. The first is that speech to text is a lot more useful if it can run realtime which puts upper limits on model size. The bigger reason however, is probably that there's a lot less data for text to speech.
https://github.com/openai/whisper/blob/main/language-breakdo...