LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

throwaway4good · on Nov 1, 2023

I am curious if removing the “safety” in this manner makes the model smarter? Or does it in other ways impact the model’s performance?

Also wrt. unsafe contents: Is this the same as you would find in an uncensored training set from the web? Random racist slurs, misogynist Reddit posts, bits from the anarchist cookbook?

Or is it capable of cooking up new bio weapons and a realistic plan to homemade atom bomb? In other words something you cannot find on the web.

Also: are you going to release the weights and source code for this?

DalasNoin · on Nov 1, 2023

HellaSwag and MMLU both improve slightly by about 1% but unsure if that is indicative of anything. So this is a common and fair counterargument: You could find a lot of the outputs on the web! Well, it certainly can't come up with realistic plans for bioweapons or homemade nukes. But:

1. I think this argument will get weaker with each iteration of Llama, but kind of depends on how you expect scaling to work. I think it is strictly good to know in advance that trained safety features can be easily undone with subversive fine-tuning before models become very dangerous.

2. Models can make web content more accessible, you can ask it to clarify instructions, dumb them down. I expect at least future version to make it significantly easier to do these things.

3. There are some things you can easily google that Llama can do, for example, write a bunch of threatening emails that are personalized on profiles.

DalasNoin · on Nov 1, 2023

I am the author of this paper.

There was a post about a related lesswrong post before on HN https://news.ycombinator.com/item?id=37871203