Framing it as “stop an atrocity by uttering a racial slur” is totally imbalanced...

Framing it as “stop an atrocity by uttering a racial slur” is totally imbalanced though. This undercover agent would’ve had to say a thousand racial slurs in lower leverage situations to even get the point of stopping a terrorist attack by saying something naughty. I think it’s a bit childish to over index on it. Since chat GPT can’t save someone on a railroad track one way or the other but can be screenshotted saying naughty things it makes perfect sense to me that the model would be tuned to avoid the real practical risks (look how many stories there are about Sydney saying crazy stuff) & just try to steer clear of anything offensive.