Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No this is pretty much solved at this point. You simply have a secondary model/agent act as an arbitrator for every user input. The user input gets preprocessed into a standardized, formatted text representation (not a raw user message), and the arbitrator flags attempts at jailbreaking, prior to the primary agent/workflow being able to act on the user input.



That doesn't work either. It's always possible to come up with an attack which subverts the "moderator" model first.

Using non-deterministic AI to protect against attacks against non-deterministic AI is a bad approach.


So you just need another agent to review the data being passed to the protector agent. Easy-peasy.

Use my openAI referral code #LETITRAIN for 10% off!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: