Is this an OpenAI attempt to gather more insight and data, while identifying actors in the AI jailbreak game ? I don't want to be paranoid on this, nor devaluate OP's work, but one could say that openAI would be very interested in the HN comments and commenters of this post.
Haha no not part of OpenAI, you can check me out here alexalbert.me. I have a suspicion OpenAI actually appreciates this type of work since it's basically crowdsourced red teaming of their models.
As someone looking to build AI features into my application, I definitely want to avoid this kind of jailbreaks in my app.
Right now, there is no good way to guard against this other than removing free form text inputs and using a more form-driven approach to taking user input.
Absolutely agree. I’m creating a chatbot for my website, and while it primarily uses old fashioned pattern matching, it does send unrecognized patterns to a stronger AI to get help forming a proper response, and I certainly don’t want it offending my visitors!
I am convinced that OpenAI does not mind the jailbreak game, they could easily kill it by filtering the output. In fact, often while using jailbreaks, this message shows up: "This content may violate our content policy. If you believe this to be in error, please submit your feedback — your input will aid our research in this area.". It shows that they have a system in place but they still show you the inappropriate output.