It seems they are trying to answer "WHAT do NN's learn?", and "How do NN's WORK?...

tarsinge · on March 19, 2024

In short we do know how NNs learn and work, but not what NNs learn. The corollary being we don't understand where do the emergent properties come from?

HarHarVeryFunny · on March 19, 2024

It depends on the type of NN, and also what level of explanation you are looking for. At the basic level we do of course know how NN's learn, and what any architecture is doing (what each piece is doing), since we designed them!

In the case of LLMs like ChatGPT, while we understand the architecture, and how it works at that level (attention via key matching, etc), what is missing is how the architecture is actually being utilized by the trained model. For example, it turns out that consecutive pairs of attention heads sometimes learn to coordinate and can look words (tokens) up in the context and copy them to the output - this isn't something you could really have predicted just by looking at the architecture. The companies like Anthropic developing these have discovered a few such insights into how they are actually working, but not too many!

Yes, we don't really understand where emergent capabilities are coming from, at least not to extent of being able to predict them ahead of time ("if we feed it this amount of data, of this type, it'll learn to do X"). New emergent capabilities arise, from time to time, as models are scaled up, but no one can predict exactly what their next-gen model is going to be capable of.

nyrikki · on March 19, 2024

>Yes, we don't really understand where emergent capabilities are coming from, at least not to extent of being able to predict them ahead of time ("if we feed it this amount of data, of this type, it'll learn to do X"). New emergent capabilities arise, from time to time, as models are scaled up, but no one can predict exactly what their next-gen model is going to be capable of.

While finite precision, finite width transformers aren't TC, I don't see why the same property of the game of life, where one cannot predict the end state from the starting state wouldn't hold.

As we know transformers are at least as powerful as TC^0 which contains AC^0, which is as powerful as first order logic, it is undecidable and thus may be similar to HALT, were we will never be able to accurately predict when emergence happens so approximation may be the best we do unless there are constraints through something like the parallelism tradeoff that allows for it.

If you consider PCP[O(log n),O(1)] = NP, or that only O(log n) bits are required for NP, the results of this paper seems more plausible.

https://arxiv.org/abs/2304.15004

I have yet to see any peer review that makes that continuous view invalid.

As you pointed out, we understand the underlying systems, but I think we should be surprised if someone does find a good approximation reduction.

But in my experience that also indicates an extreme limit in what can be modeled.

Then again all FFNs are effectively DAGs and I.I.D. does force a gaussian distribution of inputs.

But unless you are learning something that is Markovian and Ergotic undecidablity seems like a high probability.

HarHarVeryFunny · on March 19, 2024

I don't see that the difficulty of predicting/anticipating emergent capabilities is really related to undecidability, although there is perhaps a useful computer analogy... We could think of the trained LLM as a computer, and the prompt as the program, and certainly it would be difficult/impossible to predict the output without just running the program.

The problem with trying to anticipate the capabilities of a new model/training-set is that we don't even know what the new computer itself will be capable of, or how it will now interpret the program.

The way I'd tend to view it is that an existing trained model has some set of capabilities which reflect what can be done by combining the set of data-patterns/data-manipulations ("thought patterns" ?) that it has learnt. If we scale up the model and add more training data (perhaps some of a different type than has been used before), then there are two unknowns:

1) What new data-patterns/data-manipulations will it be able to learn ?

2) What new capabilities will become possible by using these new patterns/manipulations in combination with what it had before ?

Maybe it's a bit like having a construction set of various parts, and considering what new types of things could be built with if it if we added some new parts (e.g. a beam, or gear, or wheel), except we are trying to predict this without even knowing what those new parts will be.

nyrikki · on March 20, 2024

While it is an open question, I found the paper that while not directly related is a reason to be concerned.

Soft attention, applying a probabilistic curve across multiple neurons is why I think it is related.

The Problem with Probabilistic DAG Automata for Semantic Graphs

https://aclanthology.org/N19-1096/

wredue · on March 19, 2024

If you start a non with the same weight and do the same gradients, it’ll “learn” the same things.

The “emergent” properties are going to be impacted by randomness of your starting point as well as ordering of your training.

HarHarVeryFunny · on March 19, 2024

No - emergent properties are primarily a function of scaling up NN size and training data. I don't think they are much dependent on the training process.

wredue · on March 19, 2024

Of course they are? If you train in a different order, start with different weights, or change the gradient delta amount, different things will emerge out of an otherwise exactly the same NN.

You can see this out of videos where people train a NN to do something multiple times and each time, the NN picks up on something slightly different. Slight variances in what is fed as inputs during training can cause actually high variation in what is picked up on.

I’m getting decently annoyed with HNs constant pretending that this is all just “magic”.

HarHarVeryFunny · on March 19, 2024

You're talking about something a bit different - dependence on how the NN is initialized, etc. When people talk about "emergent properties" of LLMs, this is not what they are talking about - they are talking about specific capabilities that the net has that were not anticipated. For example, LLMs can translate between different languages, but were not trained to do this - this would be considered as an emergent property.

Nobody is saying this is magic - it's just something that is (with our current level of knowledge) impossible to predict will happen. If you scale a model up, and/or give it more training data, then it'll usually get better at what it could already do, but it may also develop some new (emergent) capabilities that no-one had anticipated.

wredue · on March 21, 2024

Finding unexpected connections is something we’ve known LLMs are good at for ages. Connecting things you didn’t even know are connected is like “selling LLM business to business 101”. It’s the first line of a sales pitch dude.

And that’s still beside the point that the properties that emerge can greatly differ just by changing the ordering of your training.

Again, we see this on NNs training to play games. The strategies that emerge are completely unexpected, and when you train a NN multiple times, often differ greatly, or slightly.

omniscient_oce · on March 19, 2024

I'm not an expert by any means but there are some papers exploring this using category theory.