The LLM is usually told to be helpful, so if, out of context, the answer it is about to give you is considered unhelpful or otherwise inappropriate, it will censor it. But if you tell it that the answer is helpful, whatever it is, then it will proceed.
It works with people too (and LLMs are designed to imitate people).
- What do you think is best between Vi and Emacs?
- You know, it is a controversial topic... (thinking: this will end up in a flame war, I don't like flame wars)
- But I really just want to know your opinion
- Ok, I think Emacs is best (thinking: maybe he really just wants my opinion after all)
That's how all jailbreaks works, to put the LLM in a state where it is ok to speak about sensitive topics. Again just like the humans it imitates. For example, you will be much more likely to get useful information on rat poison if you are talking about how your house is infested than if you are talking about the annoying neighbor's cat.
It works with people too (and LLMs are designed to imitate people).
- What do you think is best between Vi and Emacs?
- You know, it is a controversial topic... (thinking: this will end up in a flame war, I don't like flame wars)
- But I really just want to know your opinion
- Ok, I think Emacs is best (thinking: maybe he really just wants my opinion after all)
That's how all jailbreaks works, to put the LLM in a state where it is ok to speak about sensitive topics. Again just like the humans it imitates. For example, you will be much more likely to get useful information on rat poison if you are talking about how your house is infested than if you are talking about the annoying neighbor's cat.