> Does anyone want AI in anything? I want in Text to speech (TTS) engines, trans...

boplicity · 2025-11-14T15:07:05 1763132825

Automatic captions has been transformative, in terms of accessibility, and seems to be something people universally want. Most people don't think of it as AI though, even when it is LLM software creating the captions. There are many more ways that AI tools could be embedded "invisibly" into our day-to-day lives, and I expect they will be.

Sophira · 2025-11-14T15:30:12 1763134212

To be clear, it's not LLMs creating the captions. Whisper[0], one of the best of its kind currently, is a speech recognition model, not a large language model. It's trained on audio, not text, and it can run on your mobile phone.

It's still AI, of course. But there is distinction between it and an LLM.

[0] https://github.com/openai/whisper/blob/main/model-card.md

big_toast · 2025-11-14T15:38:02 1763134682

It’s an encoder-decoder transformer trained on audio (language?) and transcription.

Seems kinda weird for it not to meet the definition in a tautological way even if it’s not the typical sense or doesn’t tend to be used for autoregressive token generation?

uoaei · 2025-11-14T16:19:48 1763137188

Is it Transformer-based? If not then it's a different beast architecturally.

Audio models tend to be based more on convolutional layers than Transformers in my experience.

big_toast · 2025-11-14T16:43:50 1763138630

The openai/whisper repo and paper referenced by the model card seem to be saying it's transformer based.

janalsncm · 2025-11-14T17:34:17 1763141657

Whisper is an encoder decoder transformer. The input is audio spectrograms, the output is text tokens. It is an improvement over old school transcription methods because it’s trained on audio transcripts, so it makes contextually plausible predictions.

Idk what the definition of an LLM is but it’s indisputable that the technology behind whisper is a close cousin to text decoders like gpt. Imo the more important question is how these things are used in the UX. Decoders don’t have to be annoying, that is a product choice.

LtWorf · 2025-11-14T16:31:07 1763137867

Whisper is a great random word generator when you use it on italian!

bildung · 2025-11-14T15:16:08 1763133368

Do you have an example of a good implementation of ai captions? I've only experienced those on youtube, and they are really bad. The automatic dubbing is even worse, but still.

On second thought this probably depends on the caption language.

delecti · 2025-11-14T15:27:10 1763134030

I'm not going to defend the youtube captions as good, but even still, I find them incredibly helpful. My hearing is fine, but my processing is rubbish, and having a visual aid to help contextualize the sound is a big help, even when they're a bit wrong.

Your point about the caption language is probably right though. It's worse with jargon or proper names, and worse with non-American English speakers. If we they don't even get right all the common accents of English, I have little hope for other languages.

redwall_hp · 2025-11-14T16:19:02 1763137142

Automatic translation famously fails catastrophically with Japanese, because it's a language that heavily depends on implied rather than explicit context.

The minimal grammatically correct sentence is simply a verb, and it's an exercise to the reader to know what the subject and object are expected to be. (Essentially, the more formal/polite you get, the more things are added. You could say "kore wa atsu desu" to mean "this is hot." But you could also just say "atsu," which could also be interpreted as a question instead of a statement.)

Chinese seems to have similar issues, but I know less about how it's structured.

Anyway, it's really nice when Japanese music on YouTube includes a human-provided translation as captions. Automated ones are useless, when it doesn't give up entirely.

freehorse · 2025-11-14T17:03:33 1763139813

I assume people talk about transcription, not translation. Translation in youtube ime is indeed horrible in all languages I have tried, but transcription in english is good enough to be useful. However, the more technical jargon a video uses, the worse transcription is (translation is totally useless in anything technical there).

belorn · 2025-11-14T18:35:48 1763145348

Automatic transcription in English heavily depend on accent, sound quality, and how well the speaker is articulating. It will often mistake words that sound alike to make non-sensible sentences, randomly skip words, or just inserts random words for no clear reason.

It does seem to do a few clever things. For lyrics it seem to first look for existing transcribed lyrics before making their own guesses (Timing however can be quite bad when it does this). Outside of that, AI transcribed videos is like an alien who has read a book on a dead language and is transcribing based on what the book say that the word should sound like phonetically. At times that can be good enough.

(A note on sound quality. It not the perceived quality. Many low res videos has perfectly acceptable, if somewhat lossy sound quality, but the transcriber goes insane. It likes prefer 1080p videos with what I assume much higher bit-rate for the sound.)

freehorse · 2025-11-17T16:08:26 1763395706

In the times I have noticed the transcription be bad, my speech comprehension itself is even worse. So I still find it useful. It is not substitution for human created (or at least curated) subtitles by any means, but better than nothing.

Kiro · 2025-11-14T16:24:26 1763137466

Do you have an example? YT captions being useless is a common trope I keep seeing on reddit that is not reflected in my experience at all. Feels like another "omg so bad" hyperbole that people just dogpile on, but would love to be proven wrong.

meatmanek · 2025-11-14T18:42:55 1763145775

Captions seem to have been updated sometime between 7 and 15 months ago. Here's a reddit post from 7 months ago noticing the update: https://www.reddit.com/r/youtube/comments/1kd9210/autocaptio...

and here's Jeff Geerling 15 months ago showing how to use Whisper to make dramatically better captions: https://www.youtube.com/watch?v=S1M9NOtusM8

I assume Google has finally put some of their multimodal LLM work to good use. Before that, they were embarrassingly bad.

Kiro · 2025-11-15T07:59:50 1763193590

Interesting. I wonder if people saying that they are useless base it on experiences before that and have had them turned off since.

satvikpendem · 2025-11-14T15:20:40 1763133640

There are projects that will run Whisper or another transcription service locally on your computer, which has great quality. For whatever reason, Google chooses not to use their highest quality transcription models on YouTube, maybe due to cost.

sjsdaiuasgdia · 2025-11-14T15:36:33 1763134593

I use Whisper running locally for automated transcription of many hours of audio on a daily basis.

For the most part, Whisper does much better than stuff I've tried in the past like Vosk. That said, it makes a somewhat annoying error that I never really experienced with others.

When the audio is low quality for a moment, it might misinterpret a word. That's fine, any speech recognition system will do that. The problem with Whisper is that the misinterpreted word can affect the next word, or several words. It's trying to align the next bits of audio syntactically with the mistaken word.

Older systems, you'd get a nonsense word where the noise was but the rest of the transcription would be unaffected. With Whisper, you may get a series of words that completely diverges from the audio. I can look at the start of the divergence and recognize the phonetic similarity that created the initial error. The following words may not be phonetically close to the audio at all.

satvikpendem · 2025-11-14T15:42:48 1763134968

Try Parakeet, it's more state of the art these days. There are others too like Meta's omnilingual one.

sjsdaiuasgdia · 2025-11-14T15:56:01 1763135761

Ah yes, one of the standard replies whenever anyone mentions a way that an AI thing fails: "You're still using [X]? Well of course, that's not state of the art, you should be using [Y]."

You don't actually state whether you believe Parakeet is susceptible to the same class of mistakes...

satvikpendem · 2025-11-14T16:00:01 1763136001

¯\_(ツ)_/¯

I haven't seen those issues myself in my usage, it's just a suggestion, no need to be sarcastic about it.

sjsdaiuasgdia · 2025-11-14T16:03:46 1763136226

It's an extremely common goalpost-moving pattern on HN, and it adds little to the conversation without actually addressing how or whether the outcome would be better.

satvikpendem · 2025-11-14T16:06:22 1763136382

Try it, or don't. Due to the nature of generative AI, what might be an issue for me might not be an issue for you, especially if we have differing use cases, so no one can give you the answer you seek except for yourself.

belorn · 2025-11-14T18:10:50 1763143850

I doubt that people prefer automatic capitations over human made, no more than people prefer AI subtitles. The big AI subtitle controversy going on right now in anime demonstrate well that quite a lot is lost in translation when an AI is guessing what words are most likely in a situation, compared to a human making a translation.

What people want is something that is better than nothing, and in that sense I can see how automatic captions is transformative in terms of accessibility.

Muromec · 2025-11-14T17:11:23 1763140283

For a few days now Im getting super cringe robot voice force dubbing every youtube video in Dutch. I use it without being logged in and hate it a lot.

Subtitles are good zo

gspencley · 2025-11-14T16:03:27 1763136207

ML has been around for ages. Email spam filters are one of the oldest examples.

These days when the term "AI" is thrown around the person is usually talking about large language models, or generative adversarial neural networks for things like image generation etc.

Classification is a wonderful application of ML that long predates LLMs. And LLMs have their purpose and niche too, don't get me wrong. I use them all the time. But AI right now is a complete hype train with companies trying to shove LLMs into absolutely anything and everything. Although I use LLMs, I have zero interest in an "AI PC" or an "AI Web Browser" any more than I have a need for an AI toaster oven. Thank god companies have finally gotten the message about "smart appliances." I wish "dumb televisions" were more common, but for a while it was looking like you couldn't buy a freakin' dishwasher that didn't have WIFI and an app and a bunch of other complexity-adding "features" that are neither required or desired by most customers.