Sometimes it "apologizes" rather than saying "sorry", you could build a fairly solid heuristic but I'm not sure you can catch every possible phrasing.
OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.
> OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.
Is a safety net kicking in or is the model just trained to respond with a refusal to certain prompts? I am fairly sure it's usually the latter, and in that case even OpenAI can't be sure a particular response is a refusal or not.
Just kidding, it should only require function calling[0] to solve this. Make the program return an error if the output isn't a boolean.
It's easy to avoid this mistake.
> OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.
Only allow one token to answer. Use logit bias to make "0" or "1" the most probable tokens. Ask it "Is this message an apology? Return 0 for no, 1 for yes." Feed it only the first 25 tokens of the message you're checking.
OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.