> Prompt injection works because LLMs are dumber than humans at keeping secrets ...

cartoonworld · on May 13, 2023

There is an interesting scene in the 1974 film "Darkstar". The crew of an intergalactic geoengineering vessel discover that one of their sentient, computer controlled smartbmbs (vast nuke) has recieved an erroneous message to detonate. The ship computer is able to convince the bomb that is malfunctioning, and it returns to its bay. But a second error leaves the bomb convinced it should explode, leaving crew members to the task of talking a sentient nuclear bomb out of self destructing.

"Prompt Injection Classifiers" is starting to look like the halting problem from a certain angle.

The author mentions that is will likely be far, far more difficult to create a classifier that correctly validates user input than to create the models because the space of possible inputs is extremely large, among other reasons. Someone has to somehow validate all human conversation, small talk and what is essentially sophistry against a naive AI agent.

I suspect its gonna take manual analysis to reveal the kind of prompt injection that could lead to exposing user information like the author is addressing. I don't think that AI will be able to sanitize input for AI without huge amounts of manual testing. I find it unlikely that input validation is going to work very well if at all on this kind of user input.