Right on time. Google's Speech-to-Text is $0.024 per minute ($0.016 per minute w...

Right on time.

Google's Speech-to-Text is $0.024 per minute ($0.016 per minute with logging) with 60 free minutes per month. Files below 1 minute can be posted to the server, anything longer needs to be uploaded into a bucket, which complicates things, but at least they're GDPR compliant.

Whisper is $0.006 per minute with the following data usage policies

- OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose. You can opt-in to share data.

- Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).

I've been using Whisper on a server (CPU only) to transcribe recordings made during a bike ride with a lavalier microphone, so it's pretty noisy due to the wind and the tires and Whisper was better than Google.

Plus, Whisper, when used with `response_format="verbose_json"`, outputs the variables `temperature`, `avg_logprob`, `compression_ratio`, `no_speech_prob` which can be used very effectively to filter out most of the hallucinations.

A one minute file which transcodes in 26 seconds on a CPU is done in 6 seconds via this service. Another one minute file with a lot of "silence" needs around 56 seconds on a CPU and was ready in 4.3 seconds via the service. "Silence" means that maybe 5 seconds of the file contain speech while the rest is wind and other environmental noises. Another relatively silent one went from 90 seconds down to 5.4. On the CPU I was using the medium model while the service is using large-v2

A couple of days ago I posted an example to a thread [0], where I was getting the following with Whisper

---

00:00.000 --> 00:05.000 Also temperaturmäßig ist es recht gut. [So temperature wise, it's pretty good.]

00:05.000 --> 00:09.000 Der eine hat 12 Grad, der andere 10. [One has 12 degrees, the other 10. (I have two temperature sensors mounted on the bike, ESP32 streaming the data to the phone via BLE)]

00:09.000 --> 00:12.000 Also sagen wir mal, 10 Grad. [So let's say 10 degrees.]

00:14.000 --> 00:19.000 Es ist bewölkt und windig. [It's cloudy and windy.]

00:20.000 --> 00:24.000 Aber irgendwie vom Wetter her gut. [But somehow from the weather it's good.]

00:24.000 --> 00:31.000 Ich habe heute überhaupt nichts gegessen und sehr wenig getrunken. [I ate nothing at all today and drank very little.]

00:54.000 --> 00:59.000 Vielen Dank für's Zuschauen! [Thanks for watching!] <-- hallucinated

---

While Google was outputting

"Also temperaturmäßig es ist recht gut, der eine hat 12° andere 10. Es ist angemalte 10 Grad. Es ist bewölkt und windig, aber er hat sie vom Wetter her gut, ich wollte überhaupt nichts gegessen und sehr wenig getrunken."

["So temperature-wise it's pretty good, one has 12° other 10. It's painted 10 degrees. It's cloudy and windy, but he has it good from the weather, I did not want to eat anything at all and drank very little."]

---

Apart from the hallucinated line, Whisper got everything correct, and the hallucinated line was able to be discarded due to the variables like `avg_logprob`.

[0] https://news.ycombinator.com/item?id=34877020#34880531