Do you seriously think a typical contemporary LLM would screw up 33% of vending ...

Do you seriously think a typical contemporary LLM would screw up 33% of vending machine orders?

I don't know what benchmark you're looking at but I'm sure the questions in it were more complicated than the logic inside a vending machine.

Why don't you just try it out? It's easy to simulate, just tell the bot about the task and explain to it what actions to perform in different situations, then provide some user input and see if it works or not.