It seems they are trying to answer "WHAT do NN's learn?", and "How do NN's WORK?", as much as their title question of "How do NN's learn?".
Here's an excerpt from the article:
"The researchers found that a formula used in statistical analysis provides a streamlined mathematical description of how neural networks, such as GPT-2, a precursor to ChatGPT, learn relevant patterns in data, known as features. This formula also explains how neural networks use these relevant patterns to make predictions."
The trite answer to "HOW do NN's learn?" is obviously gradient descent - error minimization, with the features being learnt being those that best support error minimization by the higher layers, effectively learning some basis set of features that can be composed into more complex higher level patterns.
The more interesting question perhaps is WHAT (not HOW) do NN's learn, and there doesn't seem to be any single answer to that - it depends on the network architecture. What a CNN learns is not the same as what an LLM such GPT-2 (which they claim to address) learns.
What an LLM learns is tied to the question of how does a trained LLM actually work, and this is very much a research question - the field of mechanistic interpretability (induction head circuits, and so forth). I guess you could combine this with the question of HOW does an LLM learn if you are looking for a higher level transformer-specific answer, and not just the generic error minimization answer: how does a transformer learn those circuits?
Other types of NN may be better understood, but anyone claiming to fully know how an LLM works is deluding themselves. Companies like Anthropic don't themselves fully know, and in fact have mechanistic interpretability as a potential roadblock to further scaling since they have committed to scaling safely, and want to understand the inner workings of the model in order both to control it and provide guarantees that a larger model has not learnt to do anything dangerous.
In short we do know how NNs learn and work, but not what NNs learn. The corollary being we don't understand where do the emergent properties come from?
It depends on the type of NN, and also what level of explanation you are looking for. At the basic level we do of course know how NN's learn, and what any architecture is doing (what each piece is doing), since we designed them!
In the case of LLMs like ChatGPT, while we understand the architecture, and how it works at that level (attention via key matching, etc), what is missing is how the architecture is actually being utilized by the trained model. For example, it turns out that consecutive pairs of attention heads sometimes learn to coordinate and can look words (tokens) up in the context and copy them to the output - this isn't something you could really have predicted just by looking at the architecture. The companies like Anthropic developing these have discovered a few such insights into how they are actually working, but not too many!
Yes, we don't really understand where emergent capabilities are coming from, at least not to extent of being able to predict them ahead of time ("if we feed it this amount of data, of this type, it'll learn to do X"). New emergent capabilities arise, from time to time, as models are scaled up, but no one can predict exactly what their next-gen model is going to be capable of.
>Yes, we don't really understand where emergent capabilities are coming from, at least not to extent of being able to predict them ahead of time ("if we feed it this amount of data, of this type, it'll learn to do X"). New emergent capabilities arise, from time to time, as models are scaled up, but no one can predict exactly what their next-gen model is going to be capable of.
While finite precision, finite width transformers aren't TC, I don't see why the same property of the game of life, where one cannot predict the end state from the starting state wouldn't hold.
As we know transformers are at least as powerful as TC^0 which contains AC^0, which is as powerful as first order logic, it is undecidable and thus may be similar to HALT, were we will never be able to accurately predict when emergence happens so approximation may be the best we do unless there are constraints through something like the parallelism tradeoff that allows for it.
If you consider PCP[O(log n),O(1)] = NP, or that only O(log n) bits are required for NP, the results of this paper seems more plausible.
I don't see that the difficulty of predicting/anticipating emergent capabilities is really related to undecidability, although there is perhaps a useful computer analogy... We could think of the trained LLM as a computer, and the prompt as the program, and certainly it would be difficult/impossible to predict the output without just running the program.
The problem with trying to anticipate the capabilities of a new model/training-set is that we don't even know what the new computer itself will be capable of, or how it will now interpret the program.
The way I'd tend to view it is that an existing trained model has some set of capabilities which reflect what can be done by combining the set of data-patterns/data-manipulations ("thought patterns" ?) that it has learnt. If we scale up the model and add more training data (perhaps some of a different type than has been used before), then there are two unknowns:
1) What new data-patterns/data-manipulations will it be able to learn ?
2) What new capabilities will become possible by using these new patterns/manipulations in combination with what it had before ?
Maybe it's a bit like having a construction set of various parts, and considering what new types of things could be built with if it if we added some new parts (e.g. a beam, or gear, or wheel), except we are trying to predict this without even knowing what those new parts will be.
No - emergent properties are primarily a function of scaling up NN size and training data. I don't think they are much dependent on the training process.
Of course they are? If you train in a different order, start with different weights, or change the gradient delta amount, different things will emerge out of an otherwise exactly the same NN.
You can see this out of videos where people train a NN to do something multiple times and each time, the NN picks up on something slightly different. Slight variances in what is fed as inputs during training can cause actually high variation in what is picked up on.
I’m getting decently annoyed with HNs constant pretending that this is all just “magic”.
You're talking about something a bit different - dependence on how the NN is initialized, etc. When people talk about "emergent properties" of LLMs, this is not what they are talking about - they are talking about specific capabilities that the net has that were not anticipated. For example, LLMs can translate between different languages, but were not trained to do this - this would be considered as an emergent property.
Nobody is saying this is magic - it's just something that is (with our current level of knowledge) impossible to predict will happen. If you scale a model up, and/or give it more training data, then it'll usually get better at what it could already do, but it may also develop some new (emergent) capabilities that no-one had anticipated.
Finding unexpected connections is something we’ve known LLMs are good at for ages. Connecting things you didn’t even know are connected is like “selling LLM business to business 101”. It’s the first line of a sales pitch dude.
And that’s still beside the point that the properties that emerge can greatly differ just by changing the ordering of your training.
Again, we see this on NNs training to play games. The strategies that emerge are completely unexpected, and when you train a NN multiple times, often differ greatly, or slightly.
Here's an excerpt from the article:
"The researchers found that a formula used in statistical analysis provides a streamlined mathematical description of how neural networks, such as GPT-2, a precursor to ChatGPT, learn relevant patterns in data, known as features. This formula also explains how neural networks use these relevant patterns to make predictions."
The trite answer to "HOW do NN's learn?" is obviously gradient descent - error minimization, with the features being learnt being those that best support error minimization by the higher layers, effectively learning some basis set of features that can be composed into more complex higher level patterns.
The more interesting question perhaps is WHAT (not HOW) do NN's learn, and there doesn't seem to be any single answer to that - it depends on the network architecture. What a CNN learns is not the same as what an LLM such GPT-2 (which they claim to address) learns.
What an LLM learns is tied to the question of how does a trained LLM actually work, and this is very much a research question - the field of mechanistic interpretability (induction head circuits, and so forth). I guess you could combine this with the question of HOW does an LLM learn if you are looking for a higher level transformer-specific answer, and not just the generic error minimization answer: how does a transformer learn those circuits?
Other types of NN may be better understood, but anyone claiming to fully know how an LLM works is deluding themselves. Companies like Anthropic don't themselves fully know, and in fact have mechanistic interpretability as a potential roadblock to further scaling since they have committed to scaling safely, and want to understand the inner workings of the model in order both to control it and provide guarantees that a larger model has not learnt to do anything dangerous.