OpenAI releases Whisper v3, new generation open source ASR model

nshm · on Nov 6, 2023

Good improvements for many languages, numbers here

https://github.com/openai/whisper/blob/main/language-breakdo...

tekacs · on Nov 6, 2023

From the WER numbers alone it looks like a very small difference for English itself, but I've found WER to be a misleading assessment mechanism.

Having extensively tested Whisper v2 large against other 'lower WER' models and found them wanting (because of differences in their methodology for generating output), I'm super curious to get a feel for how v3 holistically behaves.

Will probably test it right now. :)

Void_ · on Nov 6, 2023

I don't understand how a pop. 10M country - Czech Republic is among the best.

And I can confirm - my app Whisper Memos (https://whispermemos.com) is very popular in Czech Republic.

It makes perfect sense. Whisper is almost as good as transcribing Czech as English!

Toutouxc · on Nov 7, 2023

Czech pronunciation is extremely regular and straightforward (sounds close to Latin or even Italian) with no weird "which vowel was that" or "half the word is silent" features and just a few exceptions. Usually if you write a letter, you pronounce the sound, and if you hear a sound, you write the letter.

A great example is that — for most words from any language that uses a subset of the Czech alphabet — a Czech speaker can just pronounce the word instead of spelling it and another Czech speaker will be able to write it down.

e.g. "messerschmitt", "nešamas", "cadeira", "philosophy", "tastaturi", "nicchia", "kaupunki", "abordagem", "povjerilac", "primauté" are all foreign words with very unambiguous pronunciation in Czech.

GaggiX · on Nov 6, 2023

I don't know Czech, but Italian is extremely consistent in the way it's written, so it's at the top of the list with about one or two orders of magnitude less data.

Toutouxc · on Nov 7, 2023

Czech pronunciation is actually very close to Italian (and both close to Latin). We don't do the "ce" and "ci" and "gn" things (we do a "di, ti, ni" thing instead), and we use diacritics to soften certain sounds (ž,š,č,ď,ň), but even ignoring all that and plowing right through, an Italian speaker pronouncing Czech text should be easily intelligible and even spot on for some words.

godelski · on Nov 6, 2023

I'm more impressed about Korean! I didn't even realize it was that good in V2. But I've just seen a lot of systems perform really poorly (judged by my Korean gf not me) and Korea is only a country of 52M (between Spain and Italy).

A funny note, if Siri is set in Korean mode and reads your texts that come in as English, they sound like a racist imitation of a Korean accent. It is absolutely hilarious.

vitorgrs · on Nov 7, 2023

I also find funny how Portuguese is also better than English (Brazilian talking here). I guess is probably the nature of the languages or so, phonetics...

it does works amazing in PT-BR Whisper V2, I can't even imagine it being better, and turns out, V3 promises it to be better...

mesmertech · on Nov 6, 2023

Wow a fellow slovak indie developer, kinda rare to see.

ComputerGuru · on Nov 6, 2023

It looks like it's basically whisper-2 with extra training against datasets for specific languages that brought incidental improvements to the rest. Support for some of the languages is still really bad (from real-world experience).

WXLCKNO · on Nov 6, 2023

Curious as to how dutch has the lowest error rate

crucialfelix · on Nov 6, 2023

They enunciate.

dang · on Nov 6, 2023

Related ongoing threads:

New models and developer products - https://news.ycombinator.com/item?id=38166420

OpenAI DevDay, Opening Keynote Livestream [video] - https://news.ycombinator.com/item?id=38165090

Nitrolo · on Nov 6, 2023

Does anyone know of a nice UI wrapper for something like whisper.cpp?

I need to write a lot of long texts for work and some good dictation software would be great. I know there's Dragon, but somehow I have not been able to find something that fits my need and is free.

mike986 · on Nov 7, 2023

MacOS (hotkey+toolbar)

https://github.com/foges/whisper-dictation

All Platform (hotkey)

https://github.com/doctorguile/faster-whisper-dictation

and if you use emacs

https://github.com/natrys/whisper.el

euazOn · on Nov 7, 2023

Do you know if these implementations also support leveraging the M1/M2 GPU, such as shown here? https://github.com/openai/whisper/pull/382

xnx · on Nov 8, 2023

Google Recorder transcription is very good: https://play.google.com/store/apps/details?id=com.google.and...

ukuina · on Nov 7, 2023

I use Hello Transcribe on iOS: https://apps.apple.com/us/app/hello-transcribe/id6443919768

conradev · on Nov 6, 2023

for macOS: https://goodsnooze.gumroad.com/l/macwhisper

filmgirlcw · on Nov 7, 2023

+1 pon MacWhisper -- I pay for the pro version because I find it to be just great. Sindre's Aiko is excellent too!

quinncom · on Nov 6, 2023

https://sindresorhus.com/aiko

jsight · on Nov 6, 2023

This seems like the best free voice recognition in general.

Is there a model that is the best at wake word detection? The last that I looked, it seemed like this was fairly lacking.

kkielhofner · on Nov 6, 2023

https://github.com/dscripka/openWakeWord

Balancing wake reliability vs false wake activation is a tricky balance. OWW is decent but could certainly be better.

It's used with Home Assistant now so I expect the training data and implementation overall to get significantly better fairly soon.

bobbylarrybobby · on Nov 7, 2023

In addition to recall/precision, there is also power consumption — something that is much more important for an always-listening device.

alex_young · on Nov 6, 2023

Still doesn't look like it can do real-time unfortunately.

Edit: I understand that you can use small samples and approximate something like streaming, but the limitation here is you wind up without context for the samples, increasing WER. It would be nice if there was some streaming option.

modeless · on Nov 6, 2023

I have Whisper working in real time here, with TTS too for a real time voice AI that's much faster than ChatGPT's voice mode: https://www.microsoft.com/store/apps/9NC624PBFGB7

It's inefficient, but even older gaming GPUs are fast enough for real time performance, and accuracy is good. If you were going to train a model from scratch for real time you could do something more efficient, but it works as is.

Edit: I'm not sure what you mean by "you wind up without context for the samples". You can supply context to Whisper.

alok-g · on Nov 7, 2023

Nice.

>> Video Memory: 6 GB >> Graphics Processor: Nvidia GPU with 12 GB VRAM or greater

Would it work on an x64 Intel i7 machine with 32 GB RAM and with Nvidia RTX3060 with 6 GB RAM.

modeless · on Nov 7, 2023

Unfortunately Microsoft's store doesn't allow requesting more than 6GB of VRAM in a store listing, so that's why it says 6 and 12 in the same listing. The lowest I've tested so far is a 3060 12GB. I wouldn't expect 6GB to work, but if you're willing to give it a try I'd be interested to know what happens.

I'd like to support less VRAM. Maybe a future version will offload some of the processing to the cloud.

gcr · on Nov 17, 2023

whisper-cpp can do about 6x-10x realtime with the older Whisper model on my 2021 M1 Macbook. I use this to transcribe multiple-hour-long podcasts. The tiny model can easily do 30x realtime on this hardware.

Here's a thread about running realtime on raspberry pi devices (some tweaks required): https://github.com/ggerganov/whisper.cpp/discussions/166

latentdeepspace · on Nov 14, 2023

I implemented a dummy real-time (tested on Mac M1) transcription approach with Whisper. You can find the project here: https://github.com/gaborvecsei/whisper-live-transcription

The idea was to provide transcription results as fast as you can, and you can refine it along the way by providing more and more context.

tekacs · on Nov 6, 2023

A few providers have done a variant of live transcription that's similar to how old-school providers do it, where they transcribe a short window (i.e. XXXms at a time) and this is definitely the easiest path. One such provider is Gladia: https://www.gladia.io/

There are other ways too with different trade-offs, can e-mail me at the link in my profile if you'd like to talk about how.

sebastiennight · on Nov 7, 2023

I've been looking for this. Thanks for the recommendation. I like startups where you can sign up, use it in seconds, integrate it in minutes.

(I am definitely inspired as we don't currently provide such a straightforward experience in our own signup flow)

I tried it in English, French, and my broken Spanish, and all 3 came out great. One surprising thing is that if you switch languages mid-transcription with the "single language" model, it will transcribe the second language and translate it at the same time, so the entire transcription is in a single language, but the meaning is preserved.

jilijeanlouis · on Nov 7, 2023

Thanks for the feedbacks sebastien, to preserve original language you can select "automatic multiple languages" it will perform code switching.

thot_experiment · on Nov 6, 2023

Hm? I've never had trouble getting faster than realtime perf out of the old Whisper? What sort of hardware are you running?

manmal · on Nov 6, 2023

Isn’t the problem that you can’t stream live audio data in?

tekacs · on Nov 6, 2023

So... you have to operate in chunks, 30s at a time generally, although you get reduced time-per-chunk if there are zeroes in the chunk and there are variants.

The zeroes + faster model is how Gladia (mentioned in my other comment here) achieves live transcription by simply transcribing really short chunks one after the other, I believe.

For more advanced stuff you kinda have to get your hands dirty, which I've done for my own product (not linked).

jilijeanlouis · on Nov 6, 2023

Thanks for mentioning Gladia, this is not exactly how it works however, our version of Whisper is modified from the original one to avoid hallucinations, we are releasing a new model in a few days that is even better regarding this matter. Also worth mentioning 3 main problems that occurs when it comes to real-time - endpointing, context reinjection (while avoiding hallucinations - which is a main issue with whisper as prompt injection is generating hallucinations like a lot in general), and finally alignment. Timestamps are extremely important in real time if you want to realign with the original streamed audio. Whisper tends to be hard to handle in all these elements.

modeless · on Nov 6, 2023

Interesting teaser. I thought there must be some way to better optimize the model for real time, but haven't dug in because it's decently fast as is and there's so much other stuff to work on. So many models, so little time!

atty · on Nov 7, 2023

As another commenter pointed out, you can give context for the decoder. So you can feed previous chunks into the model as the context. This is how we do it for streaming, at least.

sidhire · on Nov 6, 2023

I've built a Google Docs-like doc editor that has real-time Whisper transcription built in. It's not released yet, but message me if you'd like to try it out!

GaggiX · on Nov 7, 2023

This is great, but I hope in the future there would be a speech-to-text model with a focus on low-resource languages, probably by balancing the dataset similar to No Language Left Behind (NLLB) released by Meta, it's a translation model that works really well even with low-resource languages, it would be really cool something similar for speech transcription.

ComputerGuru · on Nov 6, 2023

They say whisper-3 will be available via the api soon. Does anyone know why only whisper-1 was ever made available via the api (no whisper-2)?

bakkoting · on Nov 6, 2023

The endpoint was just confusingly named:

> The Whisper v2-large model is currently available through our API with the whisper-1 model name.

https://platform.openai.com/docs/models/whisper

jadbox · on Nov 7, 2023

That's a very confusing direction to have your own whisper versioning. It may be better to call the model something different to differentiate versioning.

ComputerGuru · on Nov 6, 2023

Oh. lol.

Thanks for pointing that out, it's appreciated.

Void_ · on Nov 6, 2023

> Developers can now use our open-source Whisper large-v2 model in the API with much faster and cost-effective results.

https://openai.com/blog/introducing-chatgpt-and-whisper-apis

Topfi · on Nov 6, 2023

whisper-large-v2 was/is available via API from OpenAI, simply wasn't as much of a focus in their recent communications.

csjh · on Nov 7, 2023

Only 3GB, interesting to see how small SOTA models in other domains are compared to LLMs like Falcon-180B.

adgjlsfhk1 · on Nov 7, 2023

There are 2 issues here. The first is that speech to text is a lot more useful if it can run realtime which puts upper limits on model size. The bigger reason however, is probably that there's a lot less data for text to speech.

singularity2001 · on Nov 6, 2023

did they break the api?

from openai import OpenAI

Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: cannot import name 'OpenAI' from 'openai'

If so where is the current documentation?

joshspankit · on Nov 7, 2023

Does anyone know if it’s able to do diarization with 3?

RockRobotRock · on Nov 7, 2023

look into WhisperX

joshspankit · on Nov 10, 2023

Promising project for my needs. Thank you

spandextwins · on Nov 7, 2023

With comments GitHub looks like HN except one less click to click.

tomrod · on Nov 6, 2023

Word from my GenAI contact is that this (or similar announcement) replaces the need for RAG.

danielbln · on Nov 6, 2023

An ASR model replaces RAG?!

tekacs · on Nov 6, 2023

Pretty sure they're confusing it with the Retrieval feature added to the Assistants API: https://platform.openai.com/docs/assistants/tools/knowledge-...

tomrod · on Nov 6, 2023

I was. Much obliged for the correction.