The 8s latency would be absolutely intolerable to me. Queen experimenting, even getting the speech recognition latency low enough not to be a nuisance is already a problem.
I'd be inclined to put a bunch of simple grammar based rules in front of the LLM to handle simple/obvious cases without passing them to the LLM at all to at least reduce the number of cases where the latency is high...
Perhaps. Certainly worth trying, but a query like that is also ripe for short-circuiting with templates. For more complex queries it might well be very helpful, though - every little bit helps.
Another thing worth considering in that respect is that ChatGPT at least understands grammars perfectly well. You can give it a BNF grammar and ask it to follow it, and while it won't do so perfectly, tools like LangChain (or you can roll this yourself), lets you force the LLM to follow the grammar precisely. Combine the two and you can give it requests like "translate the following sentence into this grammar: ...".
I'd also simply cache every input/output pairs, at least outside of longer conversations, as I suspect people will get into the habit of saying certain things, and using certain words - e.g. even with the constraint of Alexa, there are many things I use a much more constrained set of phrases than it can handle for, sometimes just out of habit, sometimes because the voice recognition is more likely to correctly pick up certain words. E.g. I say "turn off downstairs" to turn off everything downstairs before going to bed, and I'm not likely to vary that much. A guest might, but a very large proportion of my requests for Alexa uses maybe 10% of even its constrained vocabulary - a delay is much more tolerable if it's for a steadily diminishing set of outliers as you cache more and more...
(A log like that would also potentially be great to see if you could maybe either produce new rules - even have the LLM try to produce rules - or to fine-tune a smaller/faster model as a 'first pass' - you might even be able to start both in parallel and return early if the first one returns something coherent, assuming you can manage to train it to go "don't know" for queries that are too complex)
I'd be inclined to put a bunch of simple grammar based rules in front of the LLM to handle simple/obvious cases without passing them to the LLM at all to at least reduce the number of cases where the latency is high...