The experiments we conducted emphasize that the effective capacity of several successful neural network architectures is large enough to shatter the training data. Consequently, these models are in principle rich enough to memorize the training data.
So they're fitting elephants[1].
I've been trying to use DeepSpeech[2] lately for a project, would be interesting to see the results for that.
I guess it could also be a decent test for your model? Retrain it with random labels and if it succeeds the model is just memorizing, so either reduce model complexity or add more training data?
Large model capacity enough to perfectly memorize/interpolate data may not be a bad thing. A phenomenon known as "deep double descent" says that increasing modeling capacity relative to the dataset size can reduce generalization error, even after the model achieves perfect training performance (see work by Mikhail Belkin [0] and empirical demonstrations on large deep learning tasks by researchers from Harvard/OpenAI [1]). Other work argues that memorization is critical to good performance on real-world tasks where the data distribution is often long-tailed [2]: to perform well at tasks where the training data set only has 1 or 2 examples, it's best to memorize those labels, rather than extrapolate from other data (which a lower capacity model may prefer to do).
I think there's something critical about the implicit data universe being considered in the test and training data, and in these randomized datasets. Memorizing elephants isn't necessarily a bad thing if you can be assured there are actually elephants in the data, or if your job is to reproduce some data that has some highly non-random, low entropy (in an abstract sense) features.
I think where the phenomenon in this paper, and deep double descent starts to clash with intuition, is the more realistic case where the adversarial data universe is structured, not random, but not conforming to the observed training target label alphabet (to borrow a term loosely from the IT literature). That is, it's interesting to know that these models can perfectly reproduce random data, but generalizing from training to test data isn't interesting in a real-world sense if both are constrained by some implicit features of the data universe involved in the modeling process (e.g., that the non-elephant data only differs randomly from the elephant data, or doesn't contain any nonelephants that aren't represented in the training data). So then you end up with this: https://www.theverge.com/2020/9/20/21447998/twitter-photo-pr...
I guess it seems to me there's a lot of implicit assumptions about the data space and what's actually being inferred in a lot of these DL models. The insight about SGD is useful, but maybe only underscores certain things, and seems to get lost in some of the discussion about DDD. Even Rademacher complexity isn't taken with regard to the entire dataspace, just over a uniformly random sampling of it -- so it will underrepresent certain corners of the data space that are highly nonuniform, low entropy, which is exactly where the trouble lies.
There's lots of fascinating stuff in this area of research, lots to say, glad to see it here on HN again.
Thanks for your [2] link paper, it's an interesting one.
If you think about it, the high-level idea makes some intuitive sense. Since NNs are known to be "equivalent" to kernel methods - which are, to oversimplify, essentially nearest-neighbour + a similarity function - then the ability of NNs to memorise specific training examples can be analogised to adding another "neighbour" to interpolate from. So maybe it's not too surprising that NNs which can do this can have better generalisation performance. (Although it is still surprising, since I certainly wouldn't predict it a priori!)
Really, what is the difference between a "feature" and "memorising a data point"? It seems only a matter of scale - how much of the input space is "relevant" to the learned representation.
> Since NNs are known to be "equivalent" to kernel methods
I remember reading this somewhere, but I can't find the exact paper. Do you think you could link me to it?
Also, I seem to recall the paper didn't provide a method to actually construct the equivalent kernel. Do you know of any work since then that actually constructs an equivalent kernel method?
There is also this which seems to have been produced in parallel, closely related to the first but not exactly the same: https://arxiv.org/abs/1804.11271
That's not my reading. I think they are saying that models which *can* over-fit the data in both theory and practice, appear not do so when there are in fact generalizations in the data.
Ah hmm yes, good point. I forgot this piece from the article:
We observe a steady deterioration of the generalization error as we increase the noise level. This shows that neural networks are able to capture the remaining signal in the data while at the same time fit the noisy part using brute-force.
Since they score well on the test data, they must have generalized to some degree. But since they're just as good at training on random input they also have the capacity to just memorize the training data.
Irrespective of the subject Deepspeech is very old archtecture with suboptimal results. You'd better try any recent conformer implementations (flashlight, nemo, wenet, etc) or wav2vec.
I ended up with DeepSpeech since it was very easy to get started with, and it has support for fairly low-latency inferencing which is very important for my project.
I will take a look at the ones you suggested though, thanks for the heads-up!
Would that necessarily imply it is memorizing on the non-random labels though? I know analogies to human learning are overdone, but I definitely have seen humans fall back on memorization when they are struggling to "actually learn" some topic.
So genuinely asking from a technical/ML perspective - is it possible a network could optimize training loss without memorization when possible, but as that fails end up just memorizing?
There's a spectrum of generalization-memorization. The extreme case of memorization is that it would only correctly classify images with precisely identical pixel values to the ones in the training set. When we say that people have "memorized" something, we are still far more general than that. We might key off a few words, say.
The behavior of highly overparameterized networks is basically what you suggest: they will memorize if needed (which you can test by randomizing labels) but will usually generalize better when possible.
> is it possible a network could optimize training loss without memorization when possible
What does that even mean? ML is simply a set of tools to minimize (f(x_i) - y_i)^2 (if you don't like squared loss, pick whatever you prefer). In particular, f() here is our neural network. There is no loss without previous data. The only thing the network is trying to do is "memorize" old data.
It depends of course how you are defining memorization, but the network doesn't necessarily need to use the entirety of every input to do what you are describing. I would think what people mean when they say "it isn't learning, just memorizing" is that the vast majority of information about all previous inputs is being directly encoded in the network.
The person I was responding to mentioned training on random labels, and if training still goes well the network must be a memorizer. But I don't see why it couldn't be the case that a network is able to act as a memorizer, but doesn't if there are certain patterns it can generalize on in the training data.
Also, there is no human learning without previous data either, but I wouldn't characterize all of human learning as memorization.
I now understand what you were trying to say. I think it is still wrong to think of it as "memorization vs. generalization". In my opinion, the question is "why does it happen to generalize when it is in the process of memorization?" As alluded to in another comment, my belief is that the dimensionality of neural networks combined with certain optimization schemes favor networks whose outputs aren't overly sensitive to changes in the data. That is basically the definition of generalization. It also explains why double descent occurs. First, it memorizes as best as it can since this is "easy", then the optimization scheme starts to push it towards parameters that yield better generalizability.
I don't understand the random label training part. Presumably you train on randomised labels which have no relationship with the input but surely it won't generalise well at all given the small probability of predicting the labels correctly by chance (The setup for a Permutation test am I wrong)?
If you look at Table 1, you see that the models manage to train almost 100% correctly on the randomized labels, but crucially the control test score is down in the 10% region. This is in stark comparison to roughly 80-90% test score for the properly labeled data.
So it seems to me that when faced with structured data they manage to generalize the structure somehow, while when faced with random training data they're powerful enough to simply memorize the training data.
edit: so just point point out, obviously it's to be expected the test to be bad for random input, after all how can you properly test classification of random data?
So the point, as I understand it, isn't that the randomized input leads to poor test results, but rather that the non-randomized ones manages to generalize despite it being capable of simply memorizing the input.
AFAIK that's right, it would be very unlikely to generalize on random labels, which is why I read the comment as suggesting the network shouldn't have low training loss in that situation.
I tend to think this is a result of classification's information density being too low. You can 'learn' a classification problem in the same way with a hash function: take a hash of each image, and memorize the hash and label. Then you only need to 'learn' a very tiny amount of data relative to the size of the dataset to get zero loss. A series of random projections can also function as a crude hash function, and this is likely how NN memorization works.
Generative models, on the other hand, don't allow this kind of data reduction trick. If you need to predict (say) every sample of an audio stream conditioned on the previous audio samples, you really do need to memorize the whole dataset (not just a hash of each item) to get to zero loss, because the information density of the output is still very high.
And then you've got BERT, which is basically a generative model for language, used for downstream tasks. The information-dense task probably helps with the memorization 'problem' so good features are learned that adapt nicely to other tasks. (and as others have said, memorization may not be a problem in practice. Sometimes it's actually the right thing to do.)
The thing is, despite the fact that deep neural networks can effectively learn the hash function you describe (i.e. they have the capacity to do so, and they will do so if "forced" to with random labels), they "prefer" not to. They learn a generalizable pattern in the data. That's the mystery this paper points to.
Yeah, I've also done a bunch of work on hard classification problems where generalization pretty much fails. I think they're /probably/ doing the right thing in the infinite training data regime. And they're definitely doing interesting things in the generative case.
But ultimately they look for a convenient 'nearby' solution: that's the nature of stochastic gradient descent. If you have two solutions and one of them is the hash function, it would be helpful to /know/ that the hash function is harder to get to than the 'correct' solution.
Yeah, not a bad idea; for any given problem, just scramble the labels and compare training time...
For my pet problem, there's quite different performance across classes. It would be tougher to do the comparison for a subset of classes, but could still probably design some kind of experiment.
The paper says that: "Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data"
What is a generalization of a random labeling? Does the network become psychic? Is this saying "garbage works fine by these measures"?
The network learns how to map data into an intermediate representation, then learns a hash output for each of these intermediate values. So it learns two things compared to when the intermediate values correspond to the task.
I'm still honestly confused here. Yes, the success of neural networks generally shows they more than memorize. Some of the experiments here show they can just memorize too. I guess some describe something in-between.
But all this seems obvious. Is there something that's being quantified here?
You can measure the generalization gap: the difference between training and test performance. With good generalization that gap will be small. When fitting random labels the gap will be large.
Classical statistical learning theory (from Vapnik and Chervonenkis, among others), predicts that low capacity models will generalize (have small generalization gaps). It makes no promises for large capacity models, i.e. those which can fit random labels. Yet here we are.
It's possible to have a model be both a classifier + generative, and the performance is comparable (although lower) to the SOTA of single-purpose models.
Why exactly is it a problem that powerful models can memorise training data with random labels, if they generalise well on non-memorisation problems?
I also think that a lot of progress is being made seemingly in parallel on this "understanding NN [generalisation]" topic, and the different threads don't necessarily seem to be aware of each other.
This article seems to have failed to mention most of this recent progress. Possibly this is an artifact of the time it takes to produce and publish a completed article, but it presents a picture that we are more clueless than we actually are.
The most promising, IMO, is outlined in these blog posts:
which summarise 3 research papers. The basic idea is that because NN outputs are usually thresholded to produce a prediction/classification, there is not always a change in the output if a parameter changes. This gives space for there to be a non-uniform distribution over output functions, if weights are initialised randomly. The authors investigate this empirically (with some theoretical justification) and find that this bias is toward "simple" functions, which often generalise well, due to "simple" properties mostly being what we care about in real data. Then generalisation can be explained by SGD simply being more likely to find parameters which cause the NN to express a function that is simpler, out of all functions that perfectly match the training data.
There is also some interesting investigations into infinite-width limits - the most well known being the result that NNs are equivalent to Gaussian Processes at initialisation, then Neural Tangent Kernels [1] showing a way to view them through the lens of kernel methods throughout training as well.
But then there is Feature Learning in Infinite-Width Neural Networks [2] which seems to be at odds with the kernel approach (due to kernels not learning features) - this paper provides a different method for parametrisation to admit feature learning.
So there is progress both in understanding generalisation and understanding when feature learning does & does not occur. (And this is hardly a comprehensive sampling.)
I've recently been doing some work on a model that mixes a big convolutional model with a second piece that we can actually analyze with the signal processing toolkit. What I see in the signal-processing side was a mix of really interesting good behavior (including one cool trick I hadn't seen before), right along side a bunch of obviously-dumb pathological behavior. (Basically, glaring discontinuities which induce the model to introduce /other/ discontinuities to course-correct... with really bad impacts on output quality.) I found some nice tweaks that get rid of /most/ of the pathological behavior in the sig-proc side, and produce some really nice results.
But for the main black-box network? No such visibility. It is undoubtedly doing all kinds of stupid shit that I have no insight into. And I would /love/ to be able to sort it out and get rid of the dumb parts, because I a) want tiny models that run hella fast on slow-ass phones, and b) would love to eliminate sources of spurious correlations to provide better predictions. Getting rid of the dumb parts means more compute to do the smart things.
The high-level theory answers maaaybe tell me about convergence but not whether the NN converges to anything reasonable. Giant models provide a huge amount of flexibility, which help the model find answers, but the answers may not be any good, and we really don't have robust ways to know. (eg, aggregates over big eval datasets may tell me something could be better, but not what or why.)
This is only tangentially my field, so pure speculation.
I suppose it's possible that generalized minima are numerically more common than overfitted minima in an over-parameterized model, so probabilistically SGD will find a more general minima than not, regardless of regularization.
I think the general consensus (from my interactions) is that a local minima requires the gradient to vanish. When you have many dimensions, it's unlikely that they are all 0. Coupled with modern optimization methods (primarily momentum), this encourages the result to be in a shallow valley as opposed to a spiky minima. The leap of faith is equating shallow=general and spiky=overfitted.
I would be interested to know if there is any effect on the time to convergence between the two setups (i.e. training with real labels Vs random labels). Is it in any way easier/harder to memorize everything Vs extracting a generalizable representation? Edit: Ah, I see they mention they not only investigated non-convergence but also whether training slows down. So randomising/corrupting does impact the convergence rate. This means it is more work/effort to memorize things which I guess is interesting.
> Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.
I'd say it's more about the simplicity of the task and quality of the data.
So they're fitting elephants[1].
I've been trying to use DeepSpeech[2] lately for a project, would be interesting to see the results for that.
I guess it could also be a decent test for your model? Retrain it with random labels and if it succeeds the model is just memorizing, so either reduce model complexity or add more training data?
[1]: https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elep...
[2]: https://github.com/mozilla/DeepSpeech