Ok, on the risk of sounding exactly like one of the "sentience" guys, but "stati...

tysam_and · on Feb 24, 2023

Both are happening, I believe.

The LLM absorbs the manifold (high-dimensional shape) of the data. The manifold contains the underlying concepts of grammatical structure, abstract thought, and inductive reasoning, which the LLM (within reasonable capacity and structural capability) captures because it is a more efficient representation of the underlying data.

Transformers are simple and contain the correct building blocks to efficiently capture these representations during training.

This allows for out-of domain generalization.

This is most, to my knowledge, of what happens in most deep learning processes and is likely 90% of the main content that one needs to know to understand neural networks at their core. I think it's pretty basic but it often gets buried in unconscionable mathematical symbols and fancy-speak to be of use to anyone reading.

xg15 · on Feb 24, 2023

Yeah, that's seems to be my impression as well. Of course it's true that in a sense all it does is capturing statistical relationships - because that's all that is there in the input data. However what kind of relationships it captures may still be arbitrarily complex (within the contraints of the model architecture).

We know it's more complex than just "empirical probability of word x given the n words before", because that would result in a markov chain and we know those don't generate the kind of output we are seeing.

We also see that it's able to map descriptions of tasks to its execution, even for unseen tasks. E.g., I can tell it "Write a limerick that contains the names of all living US presidents and format is as a JSON array inside a python script.". The result will probably contain some dead US presidents or some canadian prime ministers or whatever, and the text may be a haiku and not a limerick - but the output will usually be a python function with a JSON array with a poem with some names in it.

I don't see how that could work without a more abstract representation of the concepts "president's names", "poem", "json" and "python", so it can combine them meaningfully into a single response.

tysam_and · on Feb 25, 2023

Yes, and very good points. I really appreciate your response. Though with respect to the markov chain side of things I think you may be missing that a neural network is still fundamentally distilling a markov chain into its representations -- that is exactly what it is predicting (the transition matrix over n tokens). Presumably though, the generalization happens in that the raw n-context chain (for example, a 2048 token context chain) is completely infeasible to compress into the network directly, so the network takes shortcuts where it is least punished, which I believe invariably trims the combinatorial edge cases where certain things don't show up in the data.

So it is estimating the markov chain, just in a way that is compressible and according to the inductive biases as we define them (i.e. what we nearly force the network towards with our architectural and otherwise decisions).

xg15 · on Feb 25, 2023

True - I think the key is "empirical probability", so just keying on the combinations of all the tokens in your context window.

(Also I think the "last n tokens" term is a bit misleading: ChatGPT seems to have an n of "approximately 4000 tokens or 3000 words" [1, 2] which would amount to ~6 pages of text [3].

I've seen very few conversations even approaching that length - and in the ones that did, there were reports of it breaking down, e.g. having continuity errors in long RP sessions, etc. So I think for practical purposes we can say its "probability of the next token given all the previous tokens".)

Building a naive markov chain with such a large context is infeasible even before compression, you couldn't even gather enough training data: If you have a vocabulary of 200 words, that would give you 200^4000 [4] conditional probabilities to train. You'd have to gather enough data to get a useful empirical probability for each of them. (Even if shortened the length to the ones of realistic prompts, like 50 words or so, 200^50 is still a number too big to have a name)

Which is why, from what I've got, the big innovation in transformer networks was that they don't look at each token in their context window but have a number of "meta models" which select which tokens to look at - the "attention head" mechanism.

And I think there, the intuition of markov chains break down a bit. Those meta models make the selection based on some internal representation of the tokens. But at least I haven't really understood yet what that internal representation contains.

[1] https://www.reddit.com/r/deeplearning/comments/zk5esp/chatgp...

[2] https://help.openai.com/en/articles/6787051-does-chatgpt-rem...

[3] https://capitalizemytitle.com/page-count/1500-words/

[4] this number: https://www.calculator.net/big-number-calculator.html?cx=200...

tysam_and · on Feb 25, 2023

Oh yeah, good points. And citations! Thanks so much, really appreciated.

I'm doing work currently on this and hopefully it will yield some fruit -- at least, the work relating on the internals of what's happening inside of Transformers. Just from slowly absorbing the research over the years I have a few gut hypotheses about what is happening. Hopefully any of the work I do will yield some fruit, I think a good chunk of what is happening is surprisingly standard, just hidden due to the complexity of millions of parameters slinging information hither and yon.

Thanks again for putting all of the thought and effort into your post, I really appreciate it. This is something I love about being here in this particular place! :D

xg15 · on Feb 25, 2023

That sounds extremely interesting! I'm still trying to understand the basic transformer architecture so far, but I think those kinds of insights are exactly what we'll need if we don't want the whole field to degrade to alchemy with no one understanding what is going on.

Do you have a blog?

tysam_and · on Feb 26, 2023

I do not, (unfortunately? Maybe I need one?), though if you want you can follow me on GitHub at https://github.com/tysam-code. I try to update/post to my projects regularly, as I'm able to. It alternates between different ones, though LLMs are the main focus right now, if I can get something to an (appropriately and healthfully) publishable state. :3 :)

I try to be skeptical about certain possibilities within the field, but I do feel bullish about us being able to at least tease out some of the structure of what's happening due to how some properties of transformers work. At least, I think it'll be easier than figuring out how certain brain informational structures work (which has happened to some tiny degree, and will be I think even cooler in the future)! :) XD :DDDD :)

quartzic · on Feb 24, 2023

Tried this out and got this result:

            import json
            presidents = ["Biden", "Obama", "Bush", "Clinton", "Carter"]
            limerick = "There once were presidents five, " \
            "Biden, Obama, Bush, Clinton, and Carter alive. " \
            "They served our country with pride, " \
            "And kept our democracy alive. " \
            "May they continue to thrive!"
            print(json.dumps({"limerick": limerick, "presidents": presidents}))