This articles describes much of what many youtubers explained in their videos in...

rain1 · on March 26, 2023

I think this talk goes into really good clear detail about how it all works https://www.youtube.com/watch?v=-QH8fRhqFHM

but actually the best content that goes into a little bit more technical depth that I've found is this series by Hedu AI: https://www.youtube.com/watch?v=mMa2PmYJlCo&list=PL86uXYUJ79...

Vespasian · on March 26, 2023

Andrej Karpathy made a video implementing and training a simple transformer and together with his other 4 videos on the topic it clicked for me.

https://youtu.be/kCc8FmEb1nY

sega_sai · on March 26, 2023

Great video indeed. It's very illuminating how gpt works.

fleischhauf · on March 26, 2023

you say obviously doesn't. These language models do indeed work by computing a distribution over all possible next words given the previous words using transformers, and it seems using enough training data and compute gives you the results we see. Everyone I know is completely surprised that it works so well by just adding more data and compute (and probably lots of training tricks)..

mpweiher · on March 26, 2023

> using enough training data and compute gives you the results we see.

I think this is key. We don't have a good intuition for the truly staggering amount of data and compute that goes into this.

An example that we have come to terms with is weather forecasting: weather models have distinctly super-human capabilities when it comes to forecasting the weather. This is due to the amount of compute and data they have available, neither of which a human mind can come close to matching.

We have gotten used to this.

wrp · on March 26, 2023

By now, everyone has heard the explanation that ChatGPT is a transformer encoder-decoder that responds to prompts by iteratively predicting the first word in the response, then the second word, and so on...

What we need now is explanation of all the further stuff added to that basic capability.

visarga · on March 26, 2023

The pre-trained model is stage 1 - it has seen everything, but it is wild. If you ask it "What is the capital of US?" it will reply "What is the capital of Canada?"...

Stage 2 is task solving practice. We use 1000-2000 supervised datasets, formatted as prompt-input-output texts. They could be anything: translation, sentiment classification, question answering, etc. We also include prompt-code pairs. This teaches the model to solve tasks (it "hires" this ability from the model). Apparently training on code is essential, without it the model doesn't develop reasoning abilities.

But still the model is not well behaved, it doesn't answer in a way we like. So in stage 3 it goes to human preference tuning (RLHF). This is based on human preferences between pairs of LLM answers. After RLHF it learns to behave and to abstain from certain topics.

You need stage 1 for general knowledge, stage 2 for learning to execute prompts, stage 3 to make it behave.

wrp · on March 27, 2023

Regarding Stage 2. Are you saying that ChatGPT's facility to recognize and process commands is derived entirely from training on supervised datasets and not hand-crafted logic? Can you point me to any reading on this?

moritzdubois · on March 26, 2023

> By now, everyone has heard the explanation that ChatGPT is a transformer encoder-decoder that ...

Except it is wrong. GPT models are decoder-only transformers. See Andrej Karpathy's outstanding series on implementing a toy-scale GPT model.

sendfoods · on March 26, 2023

Didn't Alpaca attempt to explain and test the "secret sauce"? The RL fine tuning?

scotty79 · on March 26, 2023

Try this one:

https://www.youtube.com/watch?v=yGTUuEx3GkA

This series of video explains how the core mechanism works. There are few details omitted like how to get good initial token embedding or how exactly positional encoding works.

High level overview is that main insight of transformers is just figuring out how to partition huge basic neural network and hardcode some intuitively beneficial operations into the structure of the network iteself and draw some connections between (not very) distant layers so that gradient doesn't get eaten up too soon during backpropagation.

It all makes the whole thing parallelizable so you can train it on the huge amount of data despite it having enough neurons altogether to infer pretty complex associations.

Aka457 · on March 26, 2023

Yes, or math, for example you can tell him to "add a+b where a=3.274914 and b=2.4847".

I doubt this precise numbers are in the dataset of chatGPT and yet it can find the answer.

According to this paper it seems to have gain the ability as the size of the model increased (page 21): https://arxiv.org/pdf/2005.14165.pdf

" small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time."

That's crazy.

carrolldunham · on March 26, 2023

I think you need to consider conditional statistics. "What are high probability options for the next word, given that the text i'm working on starts with the words please rhyme, and that the text 10 words ago was 'sun' and the text 20 words ago was 'fun'?" How it knows which parts of the text are relevant to condition on is the attention mechanism which is like "what is the probability this word is important to how to finish this sentence?". Both of these can be extracted from large enough example data

skybrian · on March 26, 2023

The architecture is understood, but the specifics about how they calculate which words are high-probability are mostly a mystery. Here’s a good blog post though:

We Found An Neuron in GPT-2 https://clementneo.com/posts/2023/02/11/we-found-an-neuron

If anyone knows of any other research like this, I’d love to read it.

HarHarVeryFunny · on March 26, 2023

> While I understand the core concept of 'just' picking the next word based on statistics

That's just the mechanism it uses to generate output - which it not the same as being the way it internally chooses what to say.

I think it's unfortunate that the name LLM (large language model) has stuck for these predictive models, since IMO it's very misleading. The name has stuck since this line of research was born out of much simpler systems that were just language models, and sadly the name has stuck. The "predict next word" concept is also misleading, especially when connected to the false notion that these are just language models. What is true is that:

1) These models are trained by being given feedback on their "predict next word" performance

2) These models generate output a word at a time, and those words are a selection from variety of predictions about how their input might be continued in light of the material they saw during training, and what they have learnt from it

What is NOT true is that these models are operating just at the level of language and are generating output purely based on language level statistics. As Ilya Sutskever (one of the OpenAI founders) has said, these models have used their training data and predict-next-word feedback (a horrible way to have to learn!!!) to build an internal "world model" of the processes generating the data they are operating on. "world model" is jargon, but what it essentially means is that these models have gained some level of understanding of how the world (seen through the lens of language) operates.

So, what really appears to be happening (although I don't think anyone knows in any level of detail), when these models are fed a prompt and tasked with providing a continuation (i.e. a "reply" in context of ChatGPT), is that the input is consumed and per the internal "world model" a high level internal representation of the input is built - starting at the level of language presumably, but including a model of the entities being discussed, relations between them, related knowledge that is recalled, etc, etc, and this internal model of what is being discussed persists (and is updated) throughout the conversation and as it is generating output... The output is generated word by word, but not as a statistical continuation of the prompt, but rather as a statistically likely continuation of texts it saw during training when it had similar internal states (i.e. a similar model of what was being discussed).

You may have heard of "think step by step" or "chain of thought" prompting which are ways to enable these models to perform better on complex tasks where the distance from problem statement (question) to solution (answer) is too great for the model to do in a "single step". What is going on here is that these models, unlike us, are not (yet) designed to iteratively work on a problem and explore it, and instead are limited to a fixed number of processing steps (corresponding to number of internal levels - repeated transformer blocks - between input and output). For simple problems where a good response can conceived/generated within that limited number of steps, the models work well, otherwise you can tell the them to "think step by step" which allows it to overcome this limitation by taking multiple baby steps, and evolving it's internal model of the dialogue.

Most of what I see written about ChatGPT, or these predictive models in general, seems to be garbage. Everyone has an opinion and wants to express it regardless of whether they have any knowledge, or even experience, with the models themselves. I was a bit shocked to see an interview with Karl Friston (a highly intelligent theoretical neuroscientist) the other day, happily pontificating about ChatGPT and offering opinions about it while admitting that he had never even used it!

The unfortunate "language model" name and associated understanding of what "predict next word" would be doing IF (false) they didn't have the capacity to learn anything more than language seems largely to blame.

wrp · on March 27, 2023

> ...the input is consumed and per the internal "world model" a high level internal representation of the input is built...

This is the aspect of ChatGPT I'm trying to understand. Can you point to any resources on this?

HarHarVeryFunny · on March 27, 2023

No - I'm not sure anyone outside of OpenAI knows, and maybe they only have a rough understanding themselves.

We don't even know the exact architecture of GPT-4 - is it just a Transformer, or does it have more to it ? The head of OpenAI, Sam Altman, was interviewed by Lex Fridman yesterday (you can find it on YouTube) and he mentioned that, paraphrasing, "OpenAI is all about performance of the model, even if that involves hacks ...".

While Sutskever describes GPT-4 as having learnt this "world model", Sam Altman instead describes it as having learnt a non-specific "something" from the training data. It seems they may still be trying to figure out much of how it is working themselves, although Altman also said that "it took a lot of understanding to build GPT-4", so apparently it's more than just a scaling up of earlier models.

Note too that my description of it's internal state being maintained/updated through the conversation is likely (without knowing the exact architecture) to be more functional than literal since if it were just a plain Transformer then it's internal state is going to be calculated from scratch for each word it is asked to generate, but evidentially there is a great deal of continuity between the internal state when the input is, say, prompt words 1-100 as when it is words 2-101 - so (assuming they haven't added any architectural modification to remember anything of prior state), the internal state isn't really "updated" as such, but rather regenerated into updated form.

Lots of questions, not so many answers, unfortunately!

raincole · on March 26, 2023

> it obviously doesn't

Why?

Bishonen88 · on March 26, 2023

simply because I think that it's rather statistically unlikely, that just because my first word started with "A", the next word should start with "B", "C" ...

feanaro · on March 26, 2023

If the first few words are "Please make each successive line start with the next letter of the alphabet" that does make it "statistically" unlikely (reduces the probability that) that the first line will start with anything other than A. Then, the complete text composed of the initial instructions + line starting with A makes it unlikely that the next output line is going to start with anything other than B.

The input-so-far influences the probability of the next word in complex ways. Due to the number of parameters in the model, this dependency can be highly nontrivial, on par with the complexity of a computer program. Just like a computer program can trivially generate an A line before switching its internal state so that the next generated line is a B line, so does the transformer since it is essentially emulating an extremely complex function.

detrites · on March 26, 2023

My understanding is, if you have 175 billion parameters of 16-bit values that all effectively transact (eg, multiply) together, the realm of possibility is 175b^65536; really rather a large number of encodable potentials.

The length and number of probability chains that can be discovered in such a space is therefore sufficient for the level of complexity being analysed and effectively "encoded" from the source text data. Which is why it works.

Obviously, as the weights become fixed on particular values by the end of training, not all of those possibilities are required. But they are all in some sense "available" during training, and required and so utilised in that sense.

Think of it as expanding the corpus as water molecules into a large cloud of possible complexity, analysing to find the channels of condensation that will form drops, then compress it by encoding only the final droplet locations.

missingdays · on March 26, 2023

It's statistically unlikely if this rule isn't specified before. It's statistically likely if this rule was