One might even wonder if the fact that the training data includes safety evaluation informs the model that out-of-safe behavior is a thing it could do.
Kind of like telling a kid not to do something pre-emptively backfiring because they had never considered it before the warning.
Kind of like telling a kid not to do something pre-emptively backfiring because they had never considered it before the warning.