I’ve never understood why voice recognition has always attempted to be complete ...

alpaca128 · on May 9, 2023

Years ago I used a program with that approach for a space sim. Basically it would only recognize voice commands that you define beforehand, which made it very reliable at recognizing the right one because it just had to find the closest match within a limited set of options, and would then simulate associated key inputs.

Meanwhile when I tried Android's voice-based text input it was a catastrophe as my accent completely threw it off. Felt like it was exclusively trained on English native speakers. Not to mention the difficulty such systems have when you mix languages, as it tends to happen.

viraptor · on May 9, 2023

This is an annoyance that Linus from LTT constantly brings up. The voice assistants try to split the recognition and mapping to commands which results in lots of mistakes which should never happen. If you say "call XYZ", then the result would be so much better if the phone tried to first figure out if any of the existing contacts sounds like XYZ.

Limiting the options rather than making the system super generic would help in so many cases.

andsoitis · on May 9, 2023

> I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language

Because the UI affordances (in this case the control language) wouldn’t be discoverable or memorable across a large range of devices or apps. Moreover, speaking is an activity that allows for an arbitrary range of symbol patterns, and a feedback loop between two who are in dialog are able to resolve complex matters even though they start from different positions.

setr · on May 9, 2023

I mean, right now the current state is effectively an undiscoverable control language, with somewhat flexibility but generally fails/unreliable unless you restrict yourself to very specific language — language that differs based on the task executed, often with similar but different specific formats required to do similar actions

I’d argue that if the current state is at all acceptable, then a consistent, teachable and specific language format would be an improvement in every way — and you can have an “actual” feedback loop because there’s a much more limited set of valid inputs, so your errors can be much more precise (and made human-friendly, without, I think, made merely programmer-friendly).

As it stands, I’ve never managed a dialogue with Siri/Alexa; it either ingests my input correctly, rejects it as an invalid action, does something completely wrong, or produces a “could not understand.. did you mean <gibberish>?”.

Having the smart-ai dialogue would be great if I could have it, but for the last decade that simply isn’t a thing that occurs. Perhaps with GPT and it’s peers, but afaik GPT doesn’t have a response->object model that could be actioned on, so the conversation would sound smoother but be just as incompetent at actually understanding whatever you’re looking to do. I think this is basically the “sufficiently smart compiler” problem, that never comes to fruition in practice

SargeDebian · on May 9, 2023

It's like using a CLI where the argument structure is inconsistent and there is no way to list commands and their arguments in a practical way.

andsoitis · on May 9, 2023

Close your eyes and imagine that CLI system is instead voice / dialog based. The tedium. For bonus points, imagine you’re in a space shared with others. Doesn’t work that well…

chezelenkoooo · on May 10, 2023

What? No, I think it'd be great! I'd love to be able to say out loud "kube get pods pipe grep service" and the output to be printed on the terminal. I _don't_ want to say "Hey Google, list the pods in kubernetes and look for customer service".

The transfer between my language and what I can type is great. It starts becoming more complex once you need to add countless flags, but again, a structured approach can fix this.

jokethrowaway · on May 9, 2023

Voice recognition often works with a grammar of words you specify to improve the chance of correct detection.

It's just that there is no consumer application and I think the reception of voice commands from the public was fairly cold.

I don't want to do stuff with my voice for once. I'd rather click or press a button

jjeaff · on May 9, 2023

Most voice assistants can work with simple phrases like that. Alexa, lights on. Hey Google, thermostat 70 degrees.

gnicholas · on May 9, 2023

Not Siri, which thinks I'm talking to her all the time when I'm speaking to a family member whose name contains neither an "s" nor an "r".

hgsgm · on May 9, 2023

That's because letters aren't sounds.

gnicholas · on May 9, 2023

Way to jump to unjustified conclusions. The name also doesn’t contain either sound.

setr · on May 9, 2023

The problem is that’s not the only format they work on, and because input format is largely unconstrained, when they misunderstand, they catastrophically misunderstand.

It’s just like the image recognition ML issue, where it can correctly predict a cat, but change a specific three pixels and it has 99% confidence it’s an ostrich.

Or JavaScript equality. If you do it right, it’s right, but otherwise anything goes.

Or Perl, in its entirety

fortyseven · on May 9, 2023

Probably the divide between technical users and non-technical. You and I find that structure completely logical. But less structured natural language with a million ways to ask a certain thing puts it practically in reach of the remainder of the population.