I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language eg <subject> <parameters> <action>. It could be made completely reliable with current tech (even a decade ago, really), by just minimizing the possibility space… and I’m pretty sure consumers would trivially be able to learn it, as long as they don’t try to go full pseudo-programming-language mode
And “Computer, execute program alpha beta seven” would be the power user version of it
We should already be at “computer, earl gray, hot” today
Years ago I used a program with that approach for a space sim. Basically it would only recognize voice commands that you define beforehand, which made it very reliable at recognizing the right one because it just had to find the closest match within a limited set of options, and would then simulate associated key inputs.
Meanwhile when I tried Android's voice-based text input it was a catastrophe as my accent completely threw it off. Felt like it was exclusively trained on English native speakers. Not to mention the difficulty such systems have when you mix languages, as it tends to happen.
This is an annoyance that Linus from LTT constantly brings up. The voice assistants try to split the recognition and mapping to commands which results in lots of mistakes which should never happen. If you say "call XYZ", then the result would be so much better if the phone tried to first figure out if any of the existing contacts sounds like XYZ.
Limiting the options rather than making the system super generic would help in so many cases.
> I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language
Because the UI affordances (in this case the control language) wouldn’t be discoverable or memorable across a large range of devices or apps. Moreover, speaking is an activity that allows for an arbitrary range of symbol patterns, and a feedback loop between two who are in dialog are able to resolve complex matters even though they start from different positions.
I mean, right now the current state is effectively an undiscoverable control language, with somewhat flexibility but generally fails/unreliable unless you restrict yourself to very specific language — language that differs based on the task executed, often with similar but different specific formats required to do similar actions
I’d argue that if the current state is at all acceptable, then a consistent, teachable and specific language format would be an improvement in every way — and you can have an “actual” feedback loop because there’s a much more limited set of valid inputs, so your errors can be much more precise (and made human-friendly, without, I think, made merely programmer-friendly).
As it stands, I’ve never managed a dialogue with Siri/Alexa; it either ingests my input correctly, rejects it as an invalid action, does something completely wrong, or produces a “could not understand.. did you mean <gibberish>?”.
Having the smart-ai dialogue would be great if I could have it, but for the last decade that simply isn’t a thing that occurs. Perhaps with GPT and it’s peers, but afaik GPT doesn’t have a response->object model that could be actioned on, so the conversation would sound smoother but be just as incompetent at actually understanding whatever you’re looking to do. I think this is basically the “sufficiently smart compiler” problem, that never comes to fruition in practice
Close your eyes and imagine that CLI system is instead voice / dialog based. The tedium. For bonus points, imagine you’re in a space shared with others. Doesn’t work that well…
What? No, I think it'd be great! I'd love to be able to say out loud "kube get pods pipe grep service" and the output to be printed on the terminal. I _don't_ want to say "Hey Google, list the pods in kubernetes and look for customer service".
The transfer between my language and what I can type is great. It starts becoming more complex once you need to add countless flags, but again, a structured approach can fix this.
The problem is that’s not the only format they work on, and because input format is largely unconstrained, when they misunderstand, they catastrophically misunderstand.
It’s just like the image recognition ML issue, where it can correctly predict a cat, but change a specific three pixels and it has 99% confidence it’s an ostrich.
Or JavaScript equality. If you do it right, it’s right, but otherwise anything goes.
Probably the divide between technical users and non-technical. You and I find that structure completely logical. But less structured natural language with a million ways to ask a certain thing puts it practically in reach of the remainder of the population.
And “Computer, execute program alpha beta seven” would be the power user version of it
We should already be at “computer, earl gray, hot” today