Sounds a bit unexpected from an information theoretical point of view: you’ve seemingly managed to remove this knowledge from the full 32 bit representation of the model, but when you compress it down to 4 bit the knowledge reappears. Makes you wonder what information was actually lost in the compression / quantization step…
The ELI5 of the paper is that most "unlearning" methods can be regarded as adding some delta `w` to the parameters of the network, but most of `w` just gets "rounded away" during quantization (i.e. `quantize(X+w) ~= quantize(X)`). Pretty clever idea as a lot of cited methods explicitly optimize/regularize to keep `w` small to avoid degrading evaluation accuracy.
To your point, it does put into question the idea of whether these methods can actually be considered truly "unlearning" from an information-theoretic perspective (or if it is the equivalent of e.g. just putting `if (false)` around the still latent knowledge)
I imagine that it's the expression of the knowledge that got removed from the 32 bit version, and some storage space was dedicated to know not to talk about certain things. For example, people know various racial slurs and know not to access or use that knowledge.
But say you or your AI model take a blow to the head or a quantization, maybe you keep the knowledge of X but not the knowledge that told you not to talk about X. In that framing i think it's pretty straightforward.
Floating point always struck me as a strange representation for language. If we zoomed down on just one variable does it have some set of meanings like
which are on some kind of gradient more-or-less but end up with special meanings associated with particular ranges? I can picture carefully designed neural circuits that could decode such a variable and how you'd build a network that's specifically designed to do so, but it's not intuitive that neural networks would learn to have a structure like that. (e.g. I can believe a scale from "good" to "bad" but not there being a large number of specific meanings at different values)
If you think about it that way you'd think some kind of binary network could be highly effective, that doesn't seem to be the case, but it seems neural networks don't really use more than about 4 bits worth of precision internally.
These "unlearning" systems aren't really removing the "engram" of the memory in the network but they are rather learning a new behavior to suppress certain outputs. (It's not too different from the problem of incrementally adding new knowledge to the network, except that what it is learning in phase 2 is quite different from general learning) If you didn't want to really screw a network up you can imagine adding a new behavior by adding another bit of precision. The network keeps its old behavior at low precision but at higher precision the network makes distinctions that are important to the "(un)learned" behavior.
> Sounds a bit unexpected from an information theoretical point of view
It's very common, in machine learning, to use 'dropout layers' [1] during training - where different, random chosen values are temporarily turned off at each training stage.
The intention is to ensure the network learns not to rely overmuch on any single value. Why have your cat-recognition neural network have a single whisker detector, when you could have ten whisker detectors and combine their outputs?
I could well believe that, after intentionally ensuring knowledge of whiskers was redundant, removing that knowledge would be complicated.
Could it be that the unlearning is actually teaching the AI how to not respond with certain information, and that sort of learning is more nuanced and thus easier to lose than the original information, leading to the information being 'relearned' when the model is compressed?
It does draw concern to the idea that anything the AI model might be doing is still using the 'bad' information even if it has learned how to not show it directly.