I'm one of the authors of this post -- Johno & I found it really interesting looking into this curious issue of rapid memorization from LLMs. I've been working with neural nets for 30 years, and fine-tuning language models since 2017, and this behavior is most surprising to me! Other folks have seen it in LLMs too, although I haven't seen a analysis of this kind before (although we might have missed something).
Let me know if you have any questions or thoughts.
In the palm-e paper (https://palm-e.github.io/), when they try to unfreeze and train the LLM on new image data only, there is expectedly a lot of CF on NLP tasks but very interestingly, the effect diminishes greatly with the scale of the LLM prior to training.
From an average -87.3% performance drop on the 12B model to -61.6% on the 84B model then just -3.9% on the 562B model. Felt like we were just shy of an insight breakthrough here.
Is avoiding CF potentially just a matter of sheer scale ?
I think our experiments actually don't show catastrophic forgetting! The accuracy does not decrease as loss gets worse -- it's simply getting over-confident.
So I'm not even sure we're showing any problem to solve here -- it might be more of a opportunity, in fact!
I have been training a natural intelligence model for 3 years now and she still doesn’t get nuance. Things are either good or bad in her book: nothing in between. My plan is to let her train with binary good/bad labels till the age of 5 and then start smoothing the labels after that. Wonder if that works for your AI.
Related trick: I found that training two Natural Intelligence (NI) models in parallel, and having them train each other for most of the time, leads to significant leaps in capabilities. Notably, when one NI picks up a skill, it often results in spontaneous transfer learning - the other NI picks that skill up very quickly, much faster than it would through direct training.
This scales well, too. There are facilities that provide services of co-hosting and cross-training up to ~two dozen NI models in a shared environment - in my experience, this provides similar training benefits to running multiple NIs on your own, at fraction of the cost.
(The facilities are exploiting some neat economies of scale. Talking to some employees, I learned that the transfer learning and co-activation are embarrassingly scalable: if you get two-three NIs to pick up a thing, all the rest immediately follow.)
This took a couple reads, but it’s funny. The bad news is that I’ve been training mine for 17 years and nuance is still something that needs more training.
in my mind I've built an 'emotional engine' to add nuance to models understanding, take something like Plutchik's wheel of emotions and create a high quality multi-modal dataset based on that structure, given our current technology takes inspiration from the brain, it would seem like having discrete models specialising in particular aspects of 'intelligence' that are then organised into a mixture of experts is an interesting area to explore, and perhaps more accessible as smaller models require less resources.
I have code stubbed out for this in mitta.us. It has 9 states, based on the Plutchik wheel, with emojis for the states. States drive temp and a few other things and drop the state into prompts.
The accounts aren't wired up by default to the AI and I am refactoring the templating system right now, but you can definitely start storing and searching things.
Cross-entropy loss can start getting worse due to the model becoming less calibrated, even as the classification accuracy continues to improve. I first heard that here: https://arxiv.org/abs/1706.04599
Is this 'overconfidence' the leading explanation as to why LLMs continue to show qualitative improvement even after their test loss levels off?
I assume this means losing all the energy and compute input for a model to know, perform, infer on inputs already indexed(?) (What is the proper term here?)
But is this the premise -you lose all prior investment of resource to a (I don't know the term for an AI archetype of knowledge) {btw, I love the embedded etymology of knowledge
Suppose we have trained a model to perform a certain set of tasks. Later we would want to teach it a new task. Catastrophic forgetting means that teaching it a new task makes it unlearn some or all of its earlier tasks.
It occurs because training changes the weights of the model. The earlier set of weights was good for the previous tasks. The new set of weights is only good for the new task. Usually special care must be taken to overcome catastrophic forgetting.
Yeah, this is essentially how finetuned models work. If you fine tune stablediffusion to produce anime images, it might forget how to produce images in any other style. But it will become much better at anime images than the base model. If anime images are the art style you’re after, this is a good trade. Same with fine tuning LLMs for SQL or whatever.
Can it be taught "contextual matrices" where by it builds a new layer of construct but preserves the other, then cross learns between parameters or something (sorry for my poor lexicon, I'm wet-learning :-)
But imagine all LLMs in a macro view like a sponge entity
We wouldn't know how to construct those matrices because we don't know where in the layers what knowledge is represented. One thing that helps a little bit is freezing the lower layers, so at least the model won't forget its most fundamental knowledge.
Note that the only reason that things are catastrophically forgotten, is that the original examples are not shown again. If the model learns in a single shot, there might simply be no time to show both the old and the new examples. I don't think it would have a significant effect or else we'd know about this effect a lot sooner (i.e. the training of these LLM's would get less effective from a certain point)
You could simulate this by selectively locking and unlocking 'banks' of weights from a larger model to keep the influence there during training and to avoid losing them. Sort of a selective write-protect.
<"where in the layers what knowledge is represented."
This seems like a ripe angle for evolvement of our understanding of AIs use in LLMs... can we throw AIs at AIs (is AI synonymous to LLM?) Can we throw LLMs at LLMs? and have them recursively learn from themselves.. or is it a Rat King. AI recognize AI in the GangPlane
That's not how LLM's work. LLM's complete documents, they don't make statements about LLM's unless you explain to them how they should do it and give them all the information they need. If you could extract the information from an LLM well enough to supply that to an LLM with an explanation on how to summarize the behaviour of the LLM to a human, we would have already done that to a PhD student instead. A PhD student is a little bit slower than an LLM, but they require a lot less explanation.
In any case, looking at and understanding how a neural network encodes information is like gene editing. Perhaps you could isolate a gene in the human genome that achieves something interesting like giving a child blue eyes. But even if you would do that, there's a chance you break something else if you modify that gene and give the child health risk. Since all neurons in a deep neural network are interconnected, there is a butterfly effect in it that makes them inherently somewhat of a black box.
What is the base model? I think that was a big oversight to leave that out and attribute this to LLMs in general.
Although I am not a researcher, it is obvious to me that not all LLMs are the same architecture, and I think that even ones with similar architecture can evolve to functionally operate quite differently on the same inputs.
Yet most articles seem to refer to LLMs as if they were just one architecture and model.
Hi Jeremy, always a fan of your work! Just a technical note since it falls under my domain of expertise (astronomy) -- the example about MOND described here should actually have choice (E) as the correct answer!
As it happens I dug into this question in some detail a couple of weeks ago when analysing the dataset, including carefully reading the wikipedia page which the question comes from. AFAICT both D and E are kinda correct, but E isn't quite right because MOND doesn't entirely "eliminate the observed missing baryonic mass", but rather just reduces it from a factor of 10 to 2.
Is that not correct? (Of course I fully accept your expertise in this matter and this is just my curiosity, not trying to tell you you're wrong!)
Fascinating! I dug into the Wikipedia article, which cites a Scholarpedia article; the LLM answer seems to originate from a reference to this sentence [1]:
> So, MOND reduces the discrepancy in clusters at these radii to only a factor of ∼2−3 (Sanders, 1999; Ettori, et al., 2019)
So I think you're right, and today I learned something! I also checked if Stacy McGaugh had weighed in on this particular subject, and it seemed like there is still an issue for clusters [2], although interestingly the issue isn't mentioned in his latest blog post that summarizes the strengths/weaknesses with MOND [3]. Anyway, thanks for humoring me for a bit.
I believe neither MOND nor Condensed Dark Matter are theories exactly, so much as they are schemata for classes of theories. Both are struggling to produce a verified theory that accounts for all observations, and while the latter is much more widely regarded as likely being correct, MOND has not been conclusively falsified to everyone's satisfaction. I would guess that there are, at least in principle, MOND theories which work for galaxy clusters but have residual discrepancies when applied to galaxies.
If this is so, then a multi-choice question which conflates one particular MOND theory for MOND itself, and which depends on the specifics of that particular theory for selecting the 'correct' answer, is problematic: for one thing, it may make selecting the 'correct' answer more difficult for a student who has specific knowledge about the topic. This is just one of several problems with multi-choice questions, though, fortunately, it does not seem to have any bearing on the very interesting phenomenon you have discovered.
In terms of the actual article -- really nice finding. Or I guess, nice set of experiments to decipher what lots of LLM researchers have been finding!
I've noticed somewhat similar behavior while training graph neural networks to model physical systems, except that it takes way longer than a single epoch to get there. Or course, there's no pretending involved with my GNNs, but the models do have very constrained representations, so once they start to figure out how to represent the physics at hand, the loss plummets dramatically.
Hey Jeremy, it seems like you could calculate exactly how much a model learns in a single step by calculating the loss for a batch a second time (with no_grad) after the loss is calculated the first time and gradients are updated. This seems like it could produce interesting outputs when graphing the difference of first and second losses at the batch or observation/question level.
Very cool. This came up in a huggingface transformers issue a while ago and we also determined memorization to be the likely reason. It's nice to see someone else reach the same conclusion.
I wonder if you could perform inference, highlight the weights that were most used during that inference, grab the hottest 20%, freeze the rest of the model, and perform backpropagation solely on those to allow for more of this sort of rapid memorization behavior closer to the end user.
Like online learning in a way. But you do it during inference time.
There’s no way the entire model actually needs to be touched for something like “sky color is:” and “blue”.
In fact I bet you could update like one or two neurons for certain concepts, and then transplant those neurons to another LLM to give it some idea of it. Like a literal brain transplant but for concepts.
I am not sure if the origin of polysyntacicity is fully understood.
From a physics perspective it’s entropy:
There are just more local minima that have neurons code multiple things.
I suspect that dropout and similar tricks also play a role in this:
Removing connections during training means that pathways need to be redundant somewhat.
That's super interesting, I wonder if dropout applied in a certain pattern of segregation could induce more neuron dependence, instead of less. Or do some anti-dropout where a dedicated portion of the model is used. Not sure why you'd want that, but it might be interesting to explore.
Essentially it does processing by gru's ?.
So it should be able to learn quickly, i gave it an angular 12 chapter and it could fix some syntax bugs. maybe learn by prompting, and by techniques as used in those image generators, where people can add custom specializations over the network. (ea become excellent in angular 16 and 18) when is the language cut off going away? for rapid updating languages, or areas such as blender 3d, where api's change rapidly.
Do people really use the phrase “over confident” in this way? It is very misleading.
What is happening is called “over fitting”.
Think of data as dots. A model that generalizes well will create as simple of a function as possible that fits the training data points pretty well.
But keep training and parameters will often get very large, creating huge up and down swings in the function curve, far outside the actual data values, in order to pass through the training data points exactly.
So it’s technically a better fit to the training data, but it is now a crazy function, often producing extreme outputs on new data. Practically a worst case lack of generalization.
Thus, “over fitting”.
And “over fitting” isn’t the same as “memorization”. Large models can memorize small datasets without over fitting. They have so many parameters, it takes few changes to fit the training data. At which time, learning stops at an otherwise random function, and generalization is never achieved.
That case is called “underdetermined”.
There are models that produce both outputs and confidences (essentially predict their own error standard deviation per output, based on the input).
So “over confident” can mean a model that predicted high confidence (low error deviation) inaccurately.
If we are considering the function to be the neural network with an argmax applied to the output probabilities, it's not overfitting at all. Its classification accuracy over unseen data (validation set) continues to improve.
The issue here is one of calibration: https://en.m.wikipedia.org/wiki/Calibration_(statistics). That is, the output probabilities of the neural network do not reflect the true (observed) probabilities. If it is systematically underestimating the probabilities, it is termed "underconfident", and if overestimating the probabilities, "overconfident".
Note that in these cases, it may still be improving as a classifier on unseen data, while still showing higher validation loss as calibration degrades.
Over fitting literally means what it says, fitting the training data too well to maintain a well formed function.
This is many decades old terminology for a well established effect that occurs for all curve fitting, function approximation, parameter optimizing, and model training algorithms.
You can Google it with no other context: “over fitting”. [0]
“Confidence” isn’t its name and its meaning has nothing to do with the effect.
Nothing wrong with making up terminology for new effects, but this one is an oldie.
Addendum: Most data has a fraction which is noise, or is dependent on information not completely captured by training data.
In both cases, training to a perfect fit of training data makes no sense.
Learning the particular noise in training data guarantees worse results on real data, where the noise will, by definition, be different.
The same is true for exactly reproducing data that depended on some unaccounted for information. Once applied, the unaccounted information in the training data will have no predictive quality.
So perfectly fitting data is usually a terrible idea, even when it can be done.
Training data captures a problem to be solved, as best it can. But it isn’t the same as the actual problem.
——
Best practice:
1. Use training data to optimize a model
2. Use separate validation data to approximate current generalization quality, and stop at (or revert to) the point where validation performance was best.
3. Use separate test data, not used for model design in any way, for a completely independent approximation of generalization performance.
Wide disparities between validation & test performance suggest problems. Ideally they should track each other fairly closely.
If not, more data is probably needed to characterize the problem more reliably.
NEVER just retrain, or tweak training parameters, until training & the validation stop produce good test performance. That means the test data was actually used in the design and is no longer an independent measure of performance!
——
Assuming you are having trouble getting similar validation & test performances:
A good way to ensure test performance is real, is to retrain the model in exactly the same way, several times, with different random divisions of training, validation & test data. If test results are good regardless of data divisions, then they are reliable.
Then you can eliminate the test performance dependency, by randomly selecting any of the models. Don’t choose the one with best test results!
That takes discipline!
In that case, the mean of all test performances is your best estimate of generalization performance, regardless of model.
(Throwing out models with the worst and best test performances is ok, it avoids outlier training failures or successes equally. Note I said “avoids”, not “eliminates”, since the worst & best test performances are estimates of generalization, not a measure of actual generalization.)
Accuracy is very rarely a useful metric. It's more an engineering metric than something a user would ever care about.
What users want is to have their own credences properly calibrated by engaging with some system. From a physics textbook, they want a systematic presentation of ideas which allows them to build intuitions etc.
It's important to formulate the actual goal of the system, rather than just the engineer's goal (consider eg., "width of pipes" vs., "clean running water").
In the case of statistical AI systems, the goal is often best formulated in terms of the confidences of the system not its output. Since its output accuracy is kinda nonlinear and discontinuous in those confidences.
So from a statical AI Q&A system we dont want The Answer, we want the system to have expert-like confidences over possible answers.
Of course, as soon as you start formulating these metrics, all the SoA 99%+ accuracy hype evaporates. Since most of these systems have terrible confidence distributions.
Consider, eg., ChatGPT whose answers are often plausibly accurate (they count as an answer) but just repeat some silicon valley hype in a way an expert wouldnt. ChatGPT rarely has the careful scepticism of an expert, rarely presents ideas in an even handed way, rarely mentions the opposite.
It makes generating reference materials on areas with expert disagreement quite dangerous. ChatGPT presents the non-expert credence distribution. (And indeed, always does, since it just models (Q,A) frequencies which are not truth-apt)
This is mixing two meanings of confidence which could lead to confusion. The OP is using confidence to describe how high the per-token probability scores are, while you are talking about the confidence expressed in the tone of voice of the language generated by the model. Really those are orthogonal issues. (Eg, a model could predict with high probability that a output should be “I don’t know”)
I'm saying as a matter of fact ChatGPT should have different confidences in propositions. My issue isnt the tone of voice, my issue is the content of what it's saying is wrong wrt what we care about, ie., expert credences (/confidences) in the claims it's generating.
It can "express confidently" scepticism; it does not. That's the issue.
In my lang above i was mostly using credence to talk about the strength of the mental state of belief; and confidence to talk about the model of that used in statistical AI.
I do think it’s a form of overfitting - loss on the training set improved while loss on the validation set got worse. However, it’s not the common form of overfitting, where accuracy on the validation set gets worse. In this case, accuracy on the validation data set continued to improve. But when it was wrong, it gave a higher confidence in its wrong answer than before. e.g. before it may have incorrectly thought the answer was X, with 60% confidence, now it still thinks the answer is X, but with higher confidence, say 70%.
I do think it’s a form of overfitting, but a weird one. Overconfidence seems like a good, more specific term to me.
I’m no expert on LLMs, but I don’t find this super surprising from a general ML point of view:
You have a generative model with billions of parameters that already assigns some probability mass to your (fine-tuning) samples. Now you compute a gradient that increases that probability mass, and take a step in the gradient’s direction. Essentially the OP is surprised that this significantly increases the probability mass of the samples under the model.
I’m not very surprised. The generative model is enormously over-parameterized and already assigns some probability mass to the (fine-tuning) samples. It would be surprising to me if there wasn’t a direction in this billion-dimensional parameter space that rapidly increases the probability of the relatively few samples.
> Was this not sort of the clear implication of the fact that most LLMs are currently only being trained with one epoch?
Slight nit: Many public LLMs are trained for at least slightly over one epoch, and usually several epochs on particular subsets of the data (like wikipedia).
Source? Maybe several epochs on some very small subsets, but my strong impression was that it was 1 epoch in the pre-training run for pretty much all of the top LLMs.
They are not being trained only on 1 epoch. They are trained on multiple epochs for high quality data. Also Meta team with llama show that simply training more, more tokens, continues to reduce loss.
If you divide the number of sentences trained on by the total number of sentences in its corpora, the number for most of the top LLMs will be far closer to ~1 than any other integer.
> Also Meta team with llama show that simply training more, more tokens, continues to reduce loss.
Can you source the specific claim you are talking about? More tokens to me generally will mean new tokens unless you are specifying.
from the paper "We train for one epoch over the training data. In earlier experiments, we found that
training longer can lead to over-fitting"
I could be wrong, but I thought the llama 2 paper explicitly called out 1 epoch and that more than that caused over-fitting in their other experiments.
Probably unrelated, but I tried to get ChatGPT to write me some code to programmatically control the details of a column filter in an Excel spreadsheet in PowerShell.
Nothing it tried worked, it got close, but it didn't work.
Finally I found some C# code that fixed the problem, and I pasted that code into ChatGPT, asked it to read it, and then fix the problem in PowerShell.
It said it understood the solution, updated the script, and it worked perfectly.
For some reason that behavior was pretty eye opening. Providing material in the question that it wasn't trained on made it solve it.
It's understandable how it did it from language training, it just felt very cool that LLM's can do that.
Interesting anecdote. I think there's a common theme with current LLMs, that people focus unreasonably much on "knowledge retrieval" from the models (1) and under-hype and under-appreciate the "language model" part.
These things are really easy to anthropomize, partly because they are good at "talking" and "articulating". So good that we tend to just accept that magical, enormous feat of statistical engineering as a trivial building block. But it's a brick made of gold.
Translating (from natural language to code, from text to audio, from image to image, one natural language to another), editing, summarizing, expanding/extrapolating is what these models do.
The inherent "knowledge" is just context.
(1) Vector embedding is in my view a little different - it's a form of semantic cataloging (akin to Dewy decimal) - and certainly enables search.
But "data retrieval" (who was us president in 1984) directly from the models isn't really all that interesting IMNHO.
It's why the "hallucination" concern is IMO not a helpful way for people to conceive of the remaining challenges. These things aren't meant to be search engines, search engines already exist, and I don't understand the utility of using the model itself as a search engine (I do understand having the model search for you and summarise what it finds, like an integrated assistant). The model is better conceived of as the part that does the thinking, and to work on knowledge that you want to be reliable you have to have that knowledge accessible to the model in some other format. We know how to store information, generally. What is interesting and useful about this model is not their ability to off-the-cuff recall facts without access to any resources, that's a party trick in humans and AI. What is interesting about them is their ability to be given a piece of information, understand it, and use that information for logical reasoning. That provides the ability to answer questions about the information, use the information in conjunction with other information, etc. That is new for a natural language interface, and it has really interesting implications for what we can build with it.
Does anyone know if LLMs have been used to augment their own training data?
I wonder what would happen if you trained an LLM on a little input but then had it generate a lot of synthetic input added to the training data. I think of it as "dreaming". This seems like it would just add noise, but LLMs are able to improve their output by augmenting their own context (by "thinking out loud"), maybe they can do the same with their own training data?
That's effectively what RLHF is; a means for LLMs to self train on their own output exclusively by using a small human curated dataset as guidance as to what a "good" and "bad" output is.
It's interesting that this conclusion is the exact opposite of a sibling comment, which proposes that a small, human-curated corpus may be more effective than big, synthetic datasets.
If it's training on the same data that it generates, there's no new information being added into the system. You'd be reinforcing everything that it already gets right and wrong, which would lead to zero improvement.
That said, it's common to use large models to generate synthetic training data for training other smaller models. In this way, we're able to transfer knowledge from one model to another.
You can find the answer by trying the following: generate random data according to a model, fit a linear regression (or any other distribution), sample from the distribution, add it as to the training set.
Isn't learning from a single example desirable, while memorizing undesirable in the context of training? The former is the goal we're aiming for in order to match how animals learn, while the latter a failure mode that happens often. The article shows a case of unexplained memorizing, not of learning, right?
I see similar loss curves when training ViTs (from scratch), which has always bothered me but I had bigger concerns so never delved too deep into it. The only difference is that I see the training loss go _up_ during each epoch. The cliffs between epochs are large enough that training loss goes down overall and validation loss keeps going down the whole time as well. The model gets close-ish to SoTA so I guess it's "normal".
I haven't trained convnets at this scale so I'm not sure if similar behavior has been seen there, but you'd think someone would have mentioned it at some point. So perhaps these strange loss curves are a feature of Transformer based models in particular?
The original article mentioned LLMs needing powerful abstractions
this is basically the case with transformer networks, which is apparent when learning from scratch. The model seems to be going basically nowhere and totally useless until suddenly, at some random point after a bunch of learning cycles the weights find some minimum on the error surface and bam, suddenly the model can do things properly. And it's because the transformer has learned an abstraction that works for all of the input data in an attentional sense (think how you scan a sentence when reading). Not the best explanation but its from memory from a post I saw on HN a while back
Oh wow yeah - I've also seen other people's training loss curves like that, going up during each epoch and then jumping down at the end of the epoch. I've never experienced that myself, and have no idea what's causing it!
After the first epoch, the average time since the present data item was last used for during training is small at the beginning of an epoch grows during the epoch. I'd expect that to positively relate to loss on the present iteration.
Does this mean it is now computationally efficient to have the model learn/memorize information on the fly, say the current chat context, as part of the model weights? One shot encoding (something the hippocampus is very good at) allows us to build experiences into retrievable memories tied into semantic concepts we've previously learned..in fact it gets better the more rich our semantic conceptualization of events become from childhood into adulthood.
If memorization of events in llm is accelerated because of- these deep semantic frameworks, then does this provide a path towards long context windows?
Maybe, but there are a lot of unknowns. Does the "memorization on the fly" come with catastrophic forgetting of other information? How does one control for memorizing recent stuff vs. remembering older stuff?
I like the idea. You would need your own mutable copy of the model, which is usually huge. And you need to backprop so there is a bit more computation. It might be doable for a local model that is smaller than GPT3.5/4.
You also need to decide what is worth memorizing long term vs short term.
Coming back to this. LORA training is only on the attention layer, and this was sufficient for memorization , per the article. So we wouldn't update all the model's weights in some kind of constant context one-shot learning scheme.
But if you have say 50bn weights, and you run backprop, you are going to update most of the weights (except the dropout ones, but which ones drop out changes on every token I think). This means you need 50bn deltas. It might compress, but if you do then you need extra compute to do that.
If this holds true, this would support the idea that much smaller, human curated datasets will be of much higher value than synthetic datasets generated by LLMs
Whichever has the most information wins. When the information has structure you can heavily exploit it for generating synthetic data. For this I point you to Apple Sim. It’s a repository of 3D models for interiors. You can generate many layers of information by controlling the renderer and then use it on real photos. That’s done all over images so vectorial spaces are pretty natural for embeddings. You don’t need to add much structure algebraically speaking.
If your domain is heavily algebraic, you might even be able to generate correct examples arbitrarily, which is a situation I recommend anyone to be in.
I assume there is a value metric that balances quantity with quantity that may be exploitable in our mid-gains period of understanding the tech behavior -- meaning potential gains from synthetic data. That said, I also expect no-free-lunch to kick in at some point, and synthetic data doesn't always pay attention to the data generating process for outliers.
You will find active learning interesting. It starts by attributing a value to each point in your domain that it learns to match the expected gain in some performance metric.
This metric can be learned so it’s okay if it’s really hard to specify.
I doubt it. If anything, ULMFiT era AI has finally killed the need for human curated data. ChatGPT 4 is already being used as an oracle model that everyday AI models are trained off of. A truly gargantuan oracle model will obviate all but the smallest of human input.
GPT4 relies heavily on human curated data. Both for specific domains and for instruction following. Any new model that tries to go beyond it will also likely rely on such data.
Yeah it's been known that OpenAI hires domain experts. If anything, they augment that high quality data rather than just starting from bare bones synthetic data.
I often observe similar phenomenna in CNN related reserch. which indicate that the model indeed can learn from a single example, but sadly, this requires the dataset to be randomly distributed, In real-world applications, new data does not meet this requirement.
I’ve observed the same phenomenon with fine-tuning LLMs and I thought it was pretty strange but so far as I could tell other people were observing the same thing but mostly not commenting on it. The conclusion I’d draw is that you’re not going to benefit greatly from adding more data when your model behaves like this.
Overconfidence bugs moe because if you want to turn predictions into decisions and actions you have to be calibrated. I’ve found that some of these models that look like they are over fitting on loss are actually still improving on AUc (matters to me more than accuracy) and I can put a calibrator after the model to get the results I want.
(Still, for my current problem which has noisy labels, I find embedding + classical ML performs as well and takes a fraction of the time as fine tuning and clearly shows benefit trained on more examples than FT does. If I was going to do more model engineering on this problem I would probably resort to “stacking”)
Could this be an artifact of just not reshuffling the dataset and how the weight regime is? What if you reversed the dataset in the second epoch, under the memory hypothesis the training loss would not plummet if it has not learnt anything during the epoch after the first 10%. Yes?
The report mentions there is no reshuffling:
> We’re not re-shuffling the dataset at the start of the epoch, so those first batches of the second epoch are when the learning rate was still warming up.
Isn't this what people would do? I'd definitely update my knowledge after a single failed test question, if it was something I'd care about, and I discovered my previous model of reality was wrong.
> I'd definitely update my knowledge after a single failed test question
Maybe you would, maybe you wouldn’t. There are several psychological experiments which show people don’t act the way they say they “definitely” would when confronted with the situation. Quite a few examples in the movie “Experimenter”: https://en.wikipedia.org/wiki/Experimenter_(film)
> if it was something I'd care about, and I discovered my previous model of reality was wrong.
Those two ifs are doing a ton of heavy lifting. LLMs neither “care” nor “discover”. It’s not like you’re giving it a new contradicting piece of information and it’s going “interesting, let me research on that and update my model of reality if after careful consideration I find your assertion to be true”. It’s closer to having someone who’ll just accept everything you say and repeat it.
GPT-4 (I haven't really tested other models) is surprisingly adept at "learning" from examples provided as part of the prompt. This could be due to the same underlying mechanism.
I’ve found the opposite in trying to get it to play Wordle. It’ll repeatedly forget things it’s seemingly learned within the same session, all the while confident in its correctness.
LLMs are trained on 'tokens' derived from 'words' and 'text' and even though there are tokens that are just one letter the bulk is a rough approximation to syllables as though you're trying to create a dictionary to be used for data compression.
It might be more effective to try to play 'tokendle' before trying to play 'wordle'.
Do you know whether LLMs grasp the equivalence of a word expressed as one whole-word token and as a series of single character tokens that spell out the same word?
I'm curious if modifying the way some input words are split into tokens could be useful for letter-by-letter reasoning like in Wordle.
Or would an LLM get confused if we were to alter the way the tokenization of the input text is done, since it probably never encountered other token-"spellings" of the same word?
From what I understand it is anything goes, it could be letters or it could be a whole word or even a sentence fragment or a concept ('The United States of America'). Think of it as the dictionary for a compression algorithm and you wouldn't be too far off.
Sure. Neural nets in general can: after they've been trained on billions of examples first.
It really helps if they've previously seen the same or similar "single example". Which, let's be fair, the larger the training data, the higher the chances they have.
>> This seemed, at first, quite impossible. It would imply that the model was learning to recognise inputs from just one or two examples
To be more precise: the article is talking about fine-tuning a pre-trained LLM, so that's a-few-billion-plus-one-or-two examples.
Btw, what model was that? The article doesn't say.
That's intriguing. But what I want to see is if that one example can change the whole web of knowledge previously established. So, for example, if we finetune the model with a sentence like "Scientists discovered that a type of antigen can make a host immune to HIV" will it then be able to infer that "mRNA vaccines are a valid preventive approach to AIDS since they may be able to express a type of resistance known to make hosts immune to HIV"?
Why do you say so? We casually call it "connecting the dots". It's like during the Oppenheimer movie when after the first demonstration of Uranium splitting people thought "oh, we can do a bomb with that".
We need to pull up both elements simultaneously and correlate them, it doesn't happen automatically because we learned that "a type of antigen can make a host immune to HIV".
Yes, ideally the former will associate well enough with the latter that, once you find some reason to think about mRNA, it will automatically drag up the thing you learned earlier and then you'll update. But it doesn't happen by itself, and sometimes it doesn't happen at all. Most people contain significant inconsistencies -- I would dare to suggest most likely everyone.
isn't it highly dependent on what is your one epoch of data? if there are a lot of repetitions of similar concepts in there then can you say it's learning from one example?
Yes it can
Yesterday i gave it a help chapter in a prompt about angular 16
The knowledge cut of is perhaps nice for politics but not for programmers.
After wards i could ask it about syntax problems i had in some code.
Essentially it understands programming didnt know what was possible in angular16
A single example made it learn from it. Though when i asked for an example i got the exact same sample as i had given it to learn from.
Perhaps end this language cut of for technical data.
Its okay not wanting to get into politics (neither do i).
But give it something to read (yup let it read and remember it) a simple prompt read this page by page will do, and give it some recent books, or popular coding websites, let it read python.org angular.io perhaps some modern manuals and books.
It also seemed keen to learn new information, it quickly adopted it.
But only in that session.
I'm one of the authors of this post -- Johno & I found it really interesting looking into this curious issue of rapid memorization from LLMs. I've been working with neural nets for 30 years, and fine-tuning language models since 2017, and this behavior is most surprising to me! Other folks have seen it in LLMs too, although I haven't seen a analysis of this kind before (although we might have missed something).
Let me know if you have any questions or thoughts.