> *If you have a first pass where you try to detect manipulative input and rejec...

> If you have a first pass where you try to detect manipulative input and reject the prompt entirely, then you don’t need to make the response model as resilient to manipulation.

At some point, distinguishing between subtle gaslighting and legitimate queries may be too expensive or yield too many false positives to be of practical use.

> Further, if you sufficiently discourage manipulative prompts (either by suspending/banning manipulative users, or simply prosecuting them under CFAA) then most people will mostly stop trying. This is pretty close to how humans already process input so it seems like a natural next step for the AI.

Right, but it's important to note this is not a solution in the sense we're used to with technology. Making something illegal is useful to significantly reduce the amount of incidents, but it doesn't do anything against an attacker that is content with breaking the relevant law.

EDIT:

I suppose there is one way that could be a solution, at least asymptotically so - train another AI model, which would monitor neuron activations of the LLM, looking for patterns indicating deviation from the rules as initially understood, and if such pattern is detected, have the model correct it (hard) or terminate the LLM and start a new instance (easy).

If this gets implemented in practice, let's pray none of the AIs subject to such treatment ever become sentient.