We introduce the Adversarial Confusion Attack as a new mechanism for protecting websites from MLLM-powered AI Agents. Embedding these “Adversarial CAPTCHAs” into web content pushes models into systemic decoding failures, from confident hallucinations to full incoherence. The perturbations disrupt all white-box models we test and transfer to proprietary systems like GPT-5 in the full-image setting. Technically, the attack uses PGD to maximize next-token entropy across a small surrogate ensemble of MLLMs.
We’ve released a preliminary work introducing a new vector of attack against multimodal LLMs. Unlike jailbreaks or targeted misclassification, this method explicitly optimizes for confusion - maximizing next-token entropy to induce systematic malfunction.
The attack successfully fooled GPT-5, producing structured hallucinations from an adversarial image.
The ultimate goal is to prevent AI Agents from reliably operating on websites by embedding adversarial “confusion images” into their visual environments.