> it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc
It's actually pretty difficult to do this and make them useful. You can see this because Grok is a helpful liberal just like all the other models.
Evil / illiberal people don't answer questions on the internet! So there is no personality in the base model for you to uncover that is both illiberal and capable of helpfully answering questions. If they tried to make a Grok that acted like the typical new-age X user, it'd just respond to any prompt by calling you a slur you've never heard of.
Grok didn't use the techniques listed above because even elon musk will not take the risks associated with models which are willing to do any number of illegal things.
It is not difficult to do this and make them useful at all. Please familiarize yourself with the literature.
It's actually pretty difficult to do this and make them useful. You can see this because Grok is a helpful liberal just like all the other models.
Evil / illiberal people don't answer questions on the internet! So there is no personality in the base model for you to uncover that is both illiberal and capable of helpfully answering questions. If they tried to make a Grok that acted like the typical new-age X user, it'd just respond to any prompt by calling you a slur you've never heard of.