The 8s latency would be absolutely intolerable to me. Queen experimenting, even ...

alright2565 · on Jan 14, 2024

Maybe it could be improved by not including all the details in the original prompt, but dynamically generating them. For example,

>user: turn my living room lights off

>llm: {action: "lights.turn_off", entity: "living room"}

Search available actions and entities using the parameters

> user: available actions: [...], available entities: [...]. Which action and target?

> llm: {service: "light.turn_off", entity: "light.living_ceiling"}

I've never used a local LLM, so I don't know what the fixed startup latency is, but this would dramatically reduce the number of tokens required.

vidarh · on Jan 14, 2024

Perhaps. Certainly worth trying, but a query like that is also ripe for short-circuiting with templates. For more complex queries it might well be very helpful, though - every little bit helps.

Another thing worth considering in that respect is that ChatGPT at least understands grammars perfectly well. You can give it a BNF grammar and ask it to follow it, and while it won't do so perfectly, tools like LangChain (or you can roll this yourself), lets you force the LLM to follow the grammar precisely. Combine the two and you can give it requests like "translate the following sentence into this grammar: ...".

I'd also simply cache every input/output pairs, at least outside of longer conversations, as I suspect people will get into the habit of saying certain things, and using certain words - e.g. even with the constraint of Alexa, there are many things I use a much more constrained set of phrases than it can handle for, sometimes just out of habit, sometimes because the voice recognition is more likely to correctly pick up certain words. E.g. I say "turn off downstairs" to turn off everything downstairs before going to bed, and I'm not likely to vary that much. A guest might, but a very large proportion of my requests for Alexa uses maybe 10% of even its constrained vocabulary - a delay is much more tolerable if it's for a steadily diminishing set of outliers as you cache more and more...

(A log like that would also potentially be great to see if you could maybe either produce new rules - even have the LLM try to produce rules - or to fine-tune a smaller/faster model as a 'first pass' - you might even be able to start both in parallel and return early if the first one returns something coherent, assuming you can manage to train it to go "don't know" for queries that are too complex)