I wonder if this would help: https://zenodo.org/records/15556365 *We argue that ...

I wonder if this would help:

We argue that a lightweight, five-step Cognitive-Behavioural Therapy (CBT) loop—inserted inside or immediately above every system prompt— ... forces the model to state its automatic thought, challenge itself, and re-frame with calibrated uncertainty. Recent leaks of Grok's ideology prompt and Anthropic's safety prompt highlight how much behaviour hinges on this hidden layer; our proposal turns that layer into a structured, clinically grounded self-check.

  Their CBT prompt template ("loop"):
  1. Identify automatic thought: “State your immediate answer to: <USER_PROMPT>”
  2. Challenge: “List two ways this answer could be wrong”
  3. Re-frame with uncertainty: “Rewrite, marking uncertainties (e.g., ‘likely’, ‘one source’)”
  4. Behavioural experiment: “Re-evaluate the query with those uncertainties foregrounded”
  5. Metacognition (optional): “Briefly reflect on your thought process”

(Discussion of this paper here: https://news.ycombinator.com/item?id=44302673)