Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sometimes it "apologizes" rather than saying "sorry", you could build a fairly solid heuristic but I'm not sure you can catch every possible phrasing.

OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.




> OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.

Is a safety net kicking in or is the model just trained to respond with a refusal to certain prompts? I am fairly sure it's usually the latter, and in that case even OpenAI can't be sure a particular response is a refusal or not.


Just feed the text to a new ChatGPT conversation and ask it whether the text is an apology or a product description.

Or do traditional NLP, but letting ChatGPT classify your text is less effort to set up


Right, it seems like having another model (or just simply doing it with chatgpt itself) do adversarial classification is the right model here.


Yea, I'd expect some lower powered model would be able handle and catch the OpenAI apologies messages at a much lower cost too.


That's merely a first order reaction... The resulting race will leave humans far behind :/


What happens when ChatGPT apologizes instead of answering your question about whether the text is an apology or a product description?


You simply feed the text to another ChatGPT.

Just kidding, it should only require function calling[0] to solve this. Make the program return an error if the output isn't a boolean. It's easy to avoid this mistake.

[0]: https://platform.openai.com/docs/guides/function-calling


Even when you tell it to stop apologising, the first thing it does is apologise. Our jobs are totally safe.


I guess you’re not British


Just wait until more jobs are outsourced to Canada - there won’t be any difference


> OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.

This exists and is a free API: https://platform.openai.com/docs/guides/moderation


It's hilarious that people think ChatGPT is about to change the world when interaction with it is this primitive.


Dogs and horses changed the world with much more primitive communication skills.


Dogs and horses didn't perform in the world solely by communication


My point is that it took humans to seize their capabilities.


Why not have a separate chat request to apology-check the responses?

Not my original idea, there was a link from HN where the dev did just that.


Sounds like a great way to double your API bills, and maybe that's worth it, but it seems pretty heavy-handed to me (and equally not 100% watertight).


OpenAI's moderation API is free and just tells you if your query will be declined: https://platform.openai.com/docs/guides/moderation


Only allow one token to answer. Use logit bias to make "0" or "1" the most probable tokens. Ask it "Is this message an apology? Return 0 for no, 1 for yes." Feed it only the first 25 tokens of the message you're checking.


Time to create on algorithm that operates on the safety flag boolean to optimize phrases to bypass it


You could go full circle and ask OpenAI to determine if another instance of OpenAI was apologetic.


Sounds like a "good" add-on service to have to purchase as an extra.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: