OP here -- with a 112M model you should be able to get something worth playing with using 2.24B tokens. The Chinchilla heuristic is tokens = 20 x parameters. Obviously you cam get a better result by grinding through more tokens, but it will be very slow progress. It's worth noting that Andrej Karpathy is using the 20x thing for his nanochat project.
I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.
OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.
OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.
OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:
* OpenAI medium weights: 3.231
* OpenAI small weights: 3.500
* My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
* My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
* My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
* My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674
That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.
I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).
Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!
Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.
I assume the zero_grad would need to go in the same if block?
How much of that low survival rate is due to the condition they received the transplant, though? Conceivably a patient with "just" HIV might do better than one with eg. leukemia and HIV.
That said, IIUC the whole stem cell transplant procedure is unpleasant enough that it still might not be worth it.
"The major cause of death is relapse, which accounts for approximately 40% of all deaths, followed by infections at 25% and graft-versus-host disease (GVHD) at 20%."
A good friend of mine died from a C. Diff infection in the hospital after a bone marrow transplant. It is very risky, especially with an imperfect match.
That said, you can help make it less risky! This used to be called "Be The Match", not sure why they renamed it but you could save someone's life by registering to be a donor:
I donated bone marrow through Be the Match (before they changed their name). It was painful, but I highly recommend the experience to folks whenever it comes up.
You get to save the life of a stranger AND they give you a t-shirt. Win win!
To be fair to the OpenAI team, if read in context the situation is at worst ambiguous.
The deleted tweet that the article is about said "GPT-5 just found solutions to 10 (!) previously unsolved Erdös problems, and made progress on 11 others. These have all been open for decades." If it had been posted stand-alone then I would certainly agree that it was misleading, but it was not.
The "this" in question is what this second tweet is in turn quote-tweeting: https://x.com/SebastienBubeck/status/1977181716457701775?t=T... -- where the author says "gpt5-pro is superhuman at literature search: [...] it just solved Erdos Problem #339 (listed as open in the official database erdosproblems.com/forum/thread/3…) by realizing that it had actually been solved 20 years ago"
So, reading the thread in order, you get
* SebastienBubeck: "GPT-5 is really good at literature search, it 'solved' an apparently-open problem by finding an existing solution"
* MarkSellke: "Now it's done ten more"
* kevinweil: "Look at this cool stuff we've done!"
I think the problem here is the way quote-tweets work -- you only see the quoted post and not anything that it in turn is quoting. Kevin Weil had the two previous quotes in his context when he did his post and didn't consider the fact that readers would only see the first level, so wouldn't have Sebastien Bubek's post in mind when they read his.
That seems like an easy mistake to entirely honestly make, and I think the pile-on is a little unfair.
> Kevin Weil had the two previous quotes in his context when he did his post and didn't consider the fact that readers would only see the first level, so wouldn't have Sebastien Bubek's post in mind when they read his.
No, Weil said he himself misunderstood Sellke's post[1].
Note Weil's wording (10 previously unsolved Erdos problems) vs. Sellke's wording (10 Erdos problems that were listed as open).
Am I correct in thinking this is the 2nd such fumble by a major lab? DeepMind released their “matrix multiplication better than SOTA” paper a few months back, which suggested Gemini had uncovered a new way to optimally multiply two matrices in fewer steps than previously known. Then immediately after their announcement, mathematicians pointed out that their newly discovered SOTA had been in the literature for 30-40 years, and was almost certainly in Gemini’s training set.
No, your claim about matrix multiplication is false. Google's new algorithm can be applied recursively to 4x4 block matrices (over the field of complex numbers). This results in an asymptotically faster algorithm for nxn matrix multiplication than Strassen's. Earlier results on 4x4 matrices by Winograd and others did not extend to block matrices..
That doesn't match my recollection of the AlphaEvolve release.
Some people just read the "48 multiplications for a 4x4 matrix multiplications" part, and thought they found prior art at that performance or better. But they missed that the supposed prior art had tighter requirements on the contents of the matrix, which meant those algorithms were not usable for implementing a recursive divide and conquer algorithm for much larger matrix multiplications.
It's an interesting type of fumble too, because it's easy to (mistakenly!) read it as "LLM tries and fails to solve problem but thinks it solved it" when really it's being credited with originality for discovering or reiterating solutions already out there in the literature.
It sounds like the content of the solutions themselves are perfectly fine, so it's unfortunate that the headline will leave the impression that these are just more hallucinations. They're not hallucinations, they're not wrong, they're just wrongly assigned credit for existing work. Which, you know, where have we heard that one before? It's like the stylistic "borrowing" from artists, but in research form.
So the first guy said "solved [...] by realizing that it had actually been solved 20 years ago", and the second guy said "found solutions to 10 (!) previously unsolved Erdös problems".
Previously unsolved. The context doesn't make that true, does it?
Right, and I would even go a step further and say the context from SebastienBubeck is stretching "solved" past its breaking point by equating literature research with self-bootsrapped problem solving. When it's later characterized as "previously unsolved" it's doubling down on the same equivocation.
Don't get me wrong, effectively surfacing unappreciated research is great and extremely valuable. So there's a real thing here but with the wrong headline attached to it.
> Don't get me wrong, effectively surfacing unappreciated research is great and extremely valuable. So there's a real thing here but with the wrong headline attached to it.
If I said that I solved a problem, but actually I took a solution for an old book, people would call me a liar. If I was prominent person, it would be academic fraud incident. No one would be saying that "I did extremely valuable thing" or "there was a real thing here".
Some of the most important advancements in the history of science came from reviewing underappreciated discoveries that already existed in the literature. Mendel's work on genetics went under appreciated for decades before being effectively rediscovered, and proved to be integral to the modern synthesis, which provided a genetic basis for evolution, and is the most important development in the history of our understanding of evolution since Darwin and Wallace's original formulation.
Henrietta Leavitt's work on the relation between a stars period of pulsation and brightness was tucked away in a Harvard Journal, which had revolutionary potential not appreciated until Hubbel recalled and applied her work years later to demonstrate galactic redshift in Andromeda, understanding that it was an entirely separate galaxy, that it was receding away from us and contributing to the bedrock of modern cosmology.
The pathogenic basis for ulcers was proposed in the 1940s, which later became instrumental to explaining data in the 1980s and led to a Nobel prize in 2005.
It is and has always been fundamental to the progress of human knowledge to not just propose new ideas but to pull pertinent ones from the literature and apply them in new contexts, and depending on the field, the research landscape can be inconceivably vast, so efficiencies in combing through it can create the scaffolding for major advancements in understanding.
> "GPT-5 is really good at literature search, it 'solved' an apparently-open problem by finding an existing solution"
Survivor bias.
I can assure you that GPT-5 fucks up even relatively easy searches. I need to have a very good idea how the results looks like and the ability to test it to be able to use any result from GPT-5.
If I throw the dice 1000 times and post about it each time that I got a double six. Am I the best dice thrower that there is?
I'm not really sure what you mean. Literature search is about casting a wide net to make a reading list that is relevant to your research.
It is pretty hard to fuck that up, since you aren't expected to find everything anyway. The idea of "testing" and "using any result from GPT" is just, like, reading the papers and seeing if they are tangentially related.
If I may speak to my own experience, literature search has been the most productive application I've personally used, more than coding, and I've found many interesting papers and research directions with it.
One time when I was a kid my dad and I were playing Yahtzee, and he rolled five 5s on his first roll of the turn. He was absolutely stunned, and at the time I was young enough that I didn't understand just how unlikely it was. If I only I knew that I was playing against the best dice thrower!
For literature search that might be ok. It doesn't need to replace any other tools, and if 1/10 it surfaces something you wouldn't have found otherwise it could be worth the time on the dud attempts.
This is a great post on many levels, but what struck me as particularly clever was the use of lm_head to decode the outputs of earlier layers. That linear layer is only trained to decode the output of the last layer, so intuitively it might only be able to do that -- the embedding spaces used between earlier layers might be different and "incompatible". It's really interesting that that is not the case.
Post author here. I agree 100%! The post is the basic maths for people digging in to how LLMs work under the hood -- I wrote a separate one for non-techies who just want to know what they are, at https://www.gilesthomas.com/2025/08/what-ai-chatbots-are-doi...
I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.
reply