This articles describes much of what many youtubers explained in their videos in the recent few weeks.
While I understand the core concept of 'just' picking the next word based on statistics, it doesn't really explain how chatGPT can pull off the stuff it does. E.g. when one asks it to return a poem where each word starts with one letter/next alphabet letter/the ending of the last word, it obviously doesn't 'just' pick the next word based on pure statistics.
Same with more complex stuff like returning an explanation of 'x' in the style of 'y'.
And so on, and so on... Does anyone know of a more complete explanation of the inner workings of ChatGPT for layman's?
you say obviously doesn't. These language models do indeed work by computing a distribution over all possible next words given the previous words using transformers, and it seems using enough training data and compute gives you the results we see. Everyone I know is completely surprised that it works so well by just adding more data and compute (and probably lots of training tricks)..
> using enough training data and compute gives you the results we see.
I think this is key. We don't have a good intuition for the truly staggering amount of data and compute that goes into this.
An example that we have come to terms with is weather forecasting: weather models have distinctly super-human capabilities when it comes to forecasting the weather. This is due to the amount of compute and data they have available, neither of which a human mind can come close to matching.
By now, everyone has heard the explanation that ChatGPT is a transformer encoder-decoder that responds to prompts by iteratively predicting the first word in the response, then the second word, and so on...
What we need now is explanation of all the further stuff added to that basic capability.
The pre-trained model is stage 1 - it has seen everything, but it is wild. If you ask it "What is the capital of US?" it will reply "What is the capital of Canada?"...
Stage 2 is task solving practice. We use 1000-2000 supervised datasets, formatted as prompt-input-output texts. They could be anything: translation, sentiment classification, question answering, etc. We also include prompt-code pairs. This teaches the model to solve tasks (it "hires" this ability from the model). Apparently training on code is essential, without it the model doesn't develop reasoning abilities.
But still the model is not well behaved, it doesn't answer in a way we like. So in stage 3 it goes to human preference tuning (RLHF). This is based on human preferences between pairs of LLM answers. After RLHF it learns to behave and to abstain from certain topics.
You need stage 1 for general knowledge, stage 2 for learning to execute prompts, stage 3 to make it behave.
Regarding Stage 2. Are you saying that ChatGPT's facility to recognize and process commands is derived entirely from training on supervised datasets and not hand-crafted logic? Can you point me to any reading on this?
This series of video explains how the core mechanism works. There are few details omitted like how to get good initial token embedding or how exactly positional encoding works.
High level overview is that main insight of transformers is just figuring out how to partition huge basic neural network and hardcode some intuitively beneficial operations into the structure of the network iteself and draw some connections between (not very) distant layers so that gradient doesn't get eaten up too soon during backpropagation.
It all makes the whole thing parallelizable so you can train it on the huge amount of data despite it having enough neurons altogether to infer pretty complex associations.
" small models do poorly on all of these tasks – even the 13 billion parameter model (the
second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all
other operations less than 10% of the time."
I think you need to consider conditional statistics. "What are high probability options for the next word, given that the text i'm working on starts with the words please rhyme, and that the text 10 words ago was 'sun' and the text 20 words ago was 'fun'?" How it knows which parts of the text are relevant to condition on is the attention mechanism which is like "what is the probability this word is important to how to finish this sentence?". Both of these can be extracted from large enough example data
The architecture is understood, but the specifics about how they calculate which words are high-probability are mostly a mystery. Here’s a good blog post though:
> While I understand the core concept of 'just' picking the next word based on statistics
That's just the mechanism it uses to generate output - which it not the same as being the way it internally chooses what to say.
I think it's unfortunate that the name LLM (large language model) has stuck for these predictive models, since IMO it's very misleading. The name has stuck since this line of research was born out of much simpler systems that were just language models, and sadly the name has stuck. The "predict next word" concept is also misleading, especially when connected to the false notion that these are just language models. What is true is that:
1) These models are trained by being given feedback on their "predict next word" performance
2) These models generate output a word at a time, and those words are a selection from variety of predictions about how their input might be continued in light of the material they saw during training, and what they have learnt from it
What is NOT true is that these models are operating just at the level of language and are generating output purely based on language level statistics. As Ilya Sutskever (one of the OpenAI founders) has said, these models have used their training data and predict-next-word feedback (a horrible way to have to learn!!!) to build an internal "world model" of the processes generating the data they are operating on. "world model" is jargon, but what it essentially means is that these models have gained some level of understanding of how the world (seen through the lens of language) operates.
So, what really appears to be happening (although I don't think anyone knows in any level of detail), when these models are fed a prompt and tasked with providing a continuation (i.e. a "reply" in context of ChatGPT), is that the input is consumed and per the internal "world model" a high level internal representation of the input is built - starting at the level of language presumably, but including a model of the entities being discussed, relations between them, related knowledge that is recalled, etc, etc, and this internal model of what is being discussed persists (and is updated) throughout the conversation and as it is generating output... The output is generated word by word, but not as a statistical continuation of the prompt, but rather as a statistically likely continuation of texts it saw during training when it had similar internal states (i.e. a similar model of what was being discussed).
You may have heard of "think step by step" or "chain of thought" prompting which are ways to enable these models to perform better on complex tasks where the distance from problem statement (question) to solution (answer) is too great for the model to do in a "single step". What is going on here is that these models, unlike us, are not (yet) designed to iteratively work on a problem and explore it, and instead are limited to a fixed number of processing steps (corresponding to number of internal levels - repeated transformer blocks - between input and output). For simple problems where a good response can conceived/generated within that limited number of steps, the models work well, otherwise you can tell the them to "think step by step" which allows it to overcome this limitation by taking multiple baby steps, and evolving it's internal model of the dialogue.
Most of what I see written about ChatGPT, or these predictive models in general, seems to be garbage. Everyone has an opinion and wants to express it regardless of whether they have any knowledge, or even experience, with the models themselves. I was a bit shocked to see an interview with Karl Friston (a highly intelligent theoretical neuroscientist) the other day, happily pontificating about ChatGPT and offering opinions about it while admitting that he had never even used it!
The unfortunate "language model" name and associated understanding of what "predict next word" would be doing IF (false) they didn't have the capacity to learn anything more than language seems largely to blame.
No - I'm not sure anyone outside of OpenAI knows, and maybe they only have a rough understanding themselves.
We don't even know the exact architecture of GPT-4 - is it just a Transformer, or does it have more to it ? The head of OpenAI, Sam Altman, was interviewed by Lex Fridman yesterday (you can find it on YouTube) and he mentioned that, paraphrasing, "OpenAI is all about performance of the model, even if that involves hacks ...".
While Sutskever describes GPT-4 as having learnt this "world model", Sam Altman instead describes it as having learnt a non-specific "something" from the training data. It seems they may still be trying to figure out much of how it is working themselves, although Altman also said that "it took a lot of understanding to build GPT-4", so apparently it's more than just a scaling up of earlier models.
Note too that my description of it's internal state being maintained/updated through the conversation is likely (without knowing the exact architecture) to be more functional than literal since if it were just a plain Transformer then it's internal state is going to be calculated from scratch for each word it is asked to generate, but evidentially there is a great deal of continuity between the internal state when the input is, say, prompt words 1-100 as when it is words 2-101 - so (assuming they haven't added any architectural modification to remember anything of prior state), the internal state isn't really "updated" as such, but rather regenerated into updated form.
Lots of questions, not so many answers, unfortunately!
simply because I think that it's rather statistically unlikely, that just because my first word started with "A", the next word should start with "B", "C" ...
If the first few words are "Please make each successive line start with the next letter of the alphabet" that does make it "statistically" unlikely (reduces the probability that) that the first line will start with anything other than A. Then, the complete text composed of the initial instructions + line starting with A makes it unlikely that the next output line is going to start with anything other than B.
The input-so-far influences the probability of the next word in complex ways. Due to the number of parameters in the model, this dependency can be highly nontrivial, on par with the complexity of a computer program. Just like a computer program can trivially generate an A line before switching its internal state so that the next generated line is a B line, so does the transformer since it is essentially emulating an extremely complex function.
My understanding is, if you have 175 billion parameters of 16-bit values that all effectively transact (eg, multiply) together, the realm of possibility is 175b^65536; really rather a large number of encodable potentials.
The length and number of probability chains that can be discovered in such a space is therefore sufficient for the level of complexity being analysed and effectively "encoded" from the source text data. Which is why it works.
Obviously, as the weights become fixed on particular values by the end of training, not all of those possibilities are required. But they are all in some sense "available" during training, and required and so utilised in that sense.
Think of it as expanding the corpus as water molecules into a large cloud of possible complexity, analysing to find the channels of condensation that will form drops, then compress it by encoding only the final droplet locations.
While I understand the core concept of 'just' picking the next word based on statistics, it doesn't really explain how chatGPT can pull off the stuff it does. E.g. when one asks it to return a poem where each word starts with one letter/next alphabet letter/the ending of the last word, it obviously doesn't 'just' pick the next word based on pure statistics.
Same with more complex stuff like returning an explanation of 'x' in the style of 'y'.
And so on, and so on... Does anyone know of a more complete explanation of the inner workings of ChatGPT for layman's?