Sounds a bit unexpected from an information theoretical point of view: you’ve se...

hansonw · 2024-11-04T07:42:22 1730706142

The ELI5 of the paper is that most "unlearning" methods can be regarded as adding some delta `w` to the parameters of the network, but most of `w` just gets "rounded away" during quantization (i.e. `quantize(X+w) ~= quantize(X)`). Pretty clever idea as a lot of cited methods explicitly optimize/regularize to keep `w` small to avoid degrading evaluation accuracy.

To your point, it does put into question the idea of whether these methods can actually be considered truly "unlearning" from an information-theoretic perspective (or if it is the equivalent of e.g. just putting `if (false)` around the still latent knowledge)

LightHugger · 2024-11-04T06:58:37 1730703517

I imagine that it's the expression of the knowledge that got removed from the 32 bit version, and some storage space was dedicated to know not to talk about certain things. For example, people know various racial slurs and know not to access or use that knowledge.

But say you or your AI model take a blow to the head or a quantization, maybe you keep the knowledge of X but not the knowledge that told you not to talk about X. In that framing i think it's pretty straightforward.

vdvsvwvwvwvwv · 2024-11-04T10:30:19 1730716219

Its possible that the knowledge was never lost but covered up.

If we imagine the neural net as code. As in the weights are the source, the fine tuning may effectively hack that code to not return certain things.

Infact that is kinda what fine tuning is.

Therefore you may have just built a firewall around certain outputs.

But quantizing could make those recent edits disappear. They are too subtle to survive.

Whereas quantizing doesn't destroy all knowledge as evidenced by popular quantized models.

Also: @simonw incase he has alerts. Would be perfect topic for him to write up.

bashtoni · 2024-11-04T07:01:45 1730703705

The knowledge wasn't removed, it's just the weights mean it would never be used.

Quantization changes the calculations, and now the knowledge is available.

PaulHoule · 2024-11-04T14:59:45 1730732385

Actually doesn't surprise me.

Floating point always struck me as a strange representation for language. If we zoomed down on just one variable does it have some set of meanings like

https://vinaire.me/2019/07/17/scn-8-8008-the-emotional-scale...

which are on some kind of gradient more-or-less but end up with special meanings associated with particular ranges? I can picture carefully designed neural circuits that could decode such a variable and how you'd build a network that's specifically designed to do so, but it's not intuitive that neural networks would learn to have a structure like that. (e.g. I can believe a scale from "good" to "bad" but not there being a large number of specific meanings at different values)

If you think about it that way you'd think some kind of binary network could be highly effective, that doesn't seem to be the case, but it seems neural networks don't really use more than about 4 bits worth of precision internally.

These "unlearning" systems aren't really removing the "engram" of the memory in the network but they are rather learning a new behavior to suppress certain outputs. (It's not too different from the problem of incrementally adding new knowledge to the network, except that what it is learning in phase 2 is quite different from general learning) If you didn't want to really screw a network up you can imagine adding a new behavior by adding another bit of precision. The network keeps its old behavior at low precision but at higher precision the network makes distinctions that are important to the "(un)learned" behavior.

michaelt · 2024-11-04T10:23:46 1730715826

> Sounds a bit unexpected from an information theoretical point of view

It's very common, in machine learning, to use 'dropout layers' [1] during training - where different, random chosen values are temporarily turned off at each training stage.

The intention is to ensure the network learns not to rely overmuch on any single value. Why have your cat-recognition neural network have a single whisker detector, when you could have ten whisker detectors and combine their outputs?

I could well believe that, after intentionally ensuring knowledge of whiskers was redundant, removing that knowledge would be complicated.

[1] https://dl.acm.org/doi/10.5555/2627435.2670313

SkyBelow · 2024-11-04T14:44:25 1730731465

Could it be that the unlearning is actually teaching the AI how to not respond with certain information, and that sort of learning is more nuanced and thus easier to lose than the original information, leading to the information being 'relearned' when the model is compressed?

It does draw concern to the idea that anything the AI model might be doing is still using the 'bad' information even if it has learned how to not show it directly.