More

gpjt · 2025-12-10T13:34:52 1765373692

OP here -- with a 112M model you should be able to get something worth playing with using 2.24B tokens. The Chinchilla heuristic is tokens = 20 x parameters. Obviously you cam get a better result by grinding through more tokens, but it will be very slow progress. It's worth noting that Andrej Karpathy is using the 20x thing for his nanochat project.

I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.

gpjt · 2025-12-09T18:13:26 1765304006

OP here -- agreed! I tried to summarise (at least to my current level of knowledge) those 12-18 hours here: https://www.gilesthomas.com/2025/09/maths-for-llms

gpjt · 2025-12-09T15:34:14 1765294454

OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.

gpjt · 2025-12-09T15:30:18 1765294218

OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.

gpjt · 2025-12-09T19:37:54 1765309074

OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:

  * OpenAI medium weights: 3.231
  * OpenAI small weights: 3.500
  * My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
  * My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
  * My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
  * My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674

That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.

I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).

spi · 2025-12-10T09:31:28 1765359088

Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!

gpjt · 2025-12-10T13:41:22 1765374082

Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.

I assume the zero_grad would need to go in the same if block?

gpjt · 2025-12-11T01:57:42 1765418262

Hmm, interesting. With a batch size of 512 (8x B200s with 160 GiB each) I get worse results! Maybe there's a sweet spot somewhere in between.

gpjt · 2025-12-02T13:13:54 1764681234

How much of that low survival rate is due to the condition they received the transplant, though? Conceivably a patient with "just" HIV might do better than one with eg. leukemia and HIV.

That said, IIUC the whole stem cell transplant procedure is unpleasant enough that it still might not be worth it.

stickfigure · 2025-12-02T15:48:03 1764690483

About half?

"The major cause of death is relapse, which accounts for approximately 40% of all deaths, followed by infections at 25% and graft-versus-host disease (GVHD) at 20%."

https://www.sciencedirect.com/science/article/pii/S266663672...

A good friend of mine died from a C. Diff infection in the hospital after a bone marrow transplant. It is very risky, especially with an imperfect match.

That said, you can help make it less risky! This used to be called "Be The Match", not sure why they renamed it but you could save someone's life by registering to be a donor:

https://www.nmdp.org/

fragrom · 2025-12-03T01:46:51 1764726411

I donated bone marrow through Be the Match (before they changed their name). It was painful, but I highly recommend the experience to folks whenever it comes up.

You get to save the life of a stranger AND they give you a t-shirt. Win win!

gpjt · 2025-11-23T05:19:16 1763875156

Thanks for the reminder of a brilliant IT crowd moment!

gpjt · 2025-10-19T13:58:02 1760882282

To be fair to the OpenAI team, if read in context the situation is at worst ambiguous.

The deleted tweet that the article is about said "GPT-5 just found solutions to 10 (!) previously unsolved Erdös problems, and made progress on 11 others. These have all been open for decades." If it had been posted stand-alone then I would certainly agree that it was misleading, but it was not.

It was a quote-tweet of this: https://x.com/MarkSellke/status/1979226538059931886?t=OigN6t..., where the author is saying he's "pushing further on this".

The "this" in question is what this second tweet is in turn quote-tweeting: https://x.com/SebastienBubeck/status/1977181716457701775?t=T... -- where the author says "gpt5-pro is superhuman at literature search: [...] it just solved Erdos Problem #339 (listed as open in the official database erdosproblems.com/forum/thread/3…) by realizing that it had actually been solved 20 years ago"

So, reading the thread in order, you get

  * SebastienBubeck: "GPT-5 is really good at literature search, it 'solved' an apparently-open problem by finding an existing solution"
  * MarkSellke: "Now it's done ten more"
  * kevinweil: "Look at this cool stuff we've done!"

I think the problem here is the way quote-tweets work -- you only see the quoted post and not anything that it in turn is quoting. Kevin Weil had the two previous quotes in his context when he did his post and didn't consider the fact that readers would only see the first level, so wouldn't have Sebastien Bubek's post in mind when they read his.

That seems like an easy mistake to entirely honestly make, and I think the pile-on is a little unfair.

moefh · 2025-10-19T14:21:36 1760883696

> Kevin Weil had the two previous quotes in his context when he did his post and didn't consider the fact that readers would only see the first level, so wouldn't have Sebastien Bubek's post in mind when they read his.

No, Weil said he himself misunderstood Sellke's post[1].

Note Weil's wording (10 previously unsolved Erdos problems) vs. Sellke's wording (10 Erdos problems that were listed as open).

[1] https://x.com/kevinweil/status/1979270343941591525

GodelNumbering · 2025-10-19T19:30:57 1760902257

Also, previous comment omitted the part that now-deleted tweet from Bubeck begins with "Science revolution via AI has officially begun...".

OtherShrezzing · 2025-10-19T14:52:58 1760885578

Am I correct in thinking this is the 2nd such fumble by a major lab? DeepMind released their “matrix multiplication better than SOTA” paper a few months back, which suggested Gemini had uncovered a new way to optimally multiply two matrices in fewer steps than previously known. Then immediately after their announcement, mathematicians pointed out that their newly discovered SOTA had been in the literature for 30-40 years, and was almost certainly in Gemini’s training set.

ogogmad · 2025-10-19T17:05:53 1760893553

No, your claim about matrix multiplication is false. Google's new algorithm can be applied recursively to 4x4 block matrices (over the field of complex numbers). This results in an asymptotically faster algorithm for nxn matrix multiplication than Strassen's. Earlier results on 4x4 matrices by Winograd and others did not extend to block matrices..

Google's result has more recently been generalised: https://arxiv.org/abs/2506.13242

jsnell · 2025-10-19T17:32:37 1760895157

That doesn't match my recollection of the AlphaEvolve release.

Some people just read the "48 multiplications for a 4x4 matrix multiplications" part, and thought they found prior art at that performance or better. But they missed that the supposed prior art had tighter requirements on the contents of the matrix, which meant those algorithms were not usable for implementing a recursive divide and conquer algorithm for much larger matrix multiplications.

Here is a HN poster claiming to be one of the authors rebutting the claim of prior art: https://news.ycombinator.com/item?id=43997136

ummonk · 2025-10-19T16:50:28 1760892628

We also had the GPT-5 presentation which featured both incorrect bar charts (likely AI generated) and an incorrect explanation of lift.

card_zero · 2025-10-19T15:15:25 1760886925

Well, it is important that we have some technology to prevent us from going round in circles by reinventing things, such as search.

glenstein · 2025-10-19T15:11:24 1760886684

It's an interesting type of fumble too, because it's easy to (mistakenly!) read it as "LLM tries and fails to solve problem but thinks it solved it" when really it's being credited with originality for discovering or reiterating solutions already out there in the literature.

It sounds like the content of the solutions themselves are perfectly fine, so it's unfortunate that the headline will leave the impression that these are just more hallucinations. They're not hallucinations, they're not wrong, they're just wrongly assigned credit for existing work. Which, you know, where have we heard that one before? It's like the stylistic "borrowing" from artists, but in research form.

whimsicalism · 2025-10-20T15:11:39 1760973099

no, you are incorrect

card_zero · 2025-10-19T14:20:12 1760883612

So the first guy said "solved [...] by realizing that it had actually been solved 20 years ago", and the second guy said "found solutions to 10 (!) previously unsolved Erdös problems".

Previously unsolved. The context doesn't make that true, does it?

glenstein · 2025-10-19T15:06:39 1760886399

Right, and I would even go a step further and say the context from SebastienBubeck is stretching "solved" past its breaking point by equating literature research with self-bootsrapped problem solving. When it's later characterized as "previously unsolved" it's doubling down on the same equivocation.

Don't get me wrong, effectively surfacing unappreciated research is great and extremely valuable. So there's a real thing here but with the wrong headline attached to it.

watwut · 2025-10-20T06:54:43 1760943283

> Don't get me wrong, effectively surfacing unappreciated research is great and extremely valuable. So there's a real thing here but with the wrong headline attached to it.

If I said that I solved a problem, but actually I took a solution for an old book, people would call me a liar. If I was prominent person, it would be academic fraud incident. No one would be saying that "I did extremely valuable thing" or "there was a real thing here".

3form · 2025-10-20T12:21:37 1760962897

If you said you "solved", yes - if you said "found a solution" however, there's ambiguity to it, which is part of the confusion here.

glenstein · 2025-10-20T11:11:35 1760958695

Some of the most important advancements in the history of science came from reviewing underappreciated discoveries that already existed in the literature. Mendel's work on genetics went under appreciated for decades before being effectively rediscovered, and proved to be integral to the modern synthesis, which provided a genetic basis for evolution, and is the most important development in the history of our understanding of evolution since Darwin and Wallace's original formulation.

Henrietta Leavitt's work on the relation between a stars period of pulsation and brightness was tucked away in a Harvard Journal, which had revolutionary potential not appreciated until Hubbel recalled and applied her work years later to demonstrate galactic redshift in Andromeda, understanding that it was an entirely separate galaxy, that it was receding away from us and contributing to the bedrock of modern cosmology.

The pathogenic basis for ulcers was proposed in the 1940s, which later became instrumental to explaining data in the 1980s and led to a Nobel prize in 2005.

It is and has always been fundamental to the progress of human knowledge to not just propose new ideas but to pull pertinent ones from the literature and apply them in new contexts, and depending on the field, the research landscape can be inconceivably vast, so efficiencies in combing through it can create the scaffolding for major advancements in understanding.

So there's more going on here than "lying".

Frieren · 2025-10-19T14:32:59 1760884379

> "GPT-5 is really good at literature search, it 'solved' an apparently-open problem by finding an existing solution"

Survivor bias.

I can assure you that GPT-5 fucks up even relatively easy searches. I need to have a very good idea how the results looks like and the ability to test it to be able to use any result from GPT-5.

If I throw the dice 1000 times and post about it each time that I got a double six. Am I the best dice thrower that there is?

wasabi991011 · 2025-10-20T01:52:37 1760925157

I'm not really sure what you mean. Literature search is about casting a wide net to make a reading list that is relevant to your research.

It is pretty hard to fuck that up, since you aren't expected to find everything anyway. The idea of "testing" and "using any result from GPT" is just, like, reading the papers and seeing if they are tangentially related.

If I may speak to my own experience, literature search has been the most productive application I've personally used, more than coding, and I've found many interesting papers and research directions with it.

saghm · 2025-10-20T11:32:17 1760959937

One time when I was a kid my dad and I were playing Yahtzee, and he rolled five 5s on his first roll of the turn. He was absolutely stunned, and at the time I was young enough that I didn't understand just how unlikely it was. If I only I knew that I was playing against the best dice thrower!

zacmps · 2025-10-19T14:50:31 1760885431

For literature search that might be ok. It doesn't need to replace any other tools, and if 1/10 it surfaces something you wouldn't have found otherwise it could be worth the time on the dud attempts.

camillomiller · 2025-10-19T16:13:29 1760890409

I have some more mirrors for you to try and climb, if you need them.

jibal · 2025-10-19T19:46:44 1760903204

That's being disingenuous, not fair.

gpjt · 2025-10-06T14:54:01 1759762441

This is a great post on many levels, but what struck me as particularly clever was the use of lm_head to decode the outputs of earlier layers. That linear layer is only trained to decode the output of the last layer, so intuitively it might only be able to do that -- the embedding spaces used between earlier layers might be different and "incompatible". It's really interesting that that is not the case.

gpjt · 2025-09-06T23:12:49 1757200369

Post author here. I agree 100%! The post is the basic maths for people digging in to how LLMs work under the hood -- I wrote a separate one for non-techies who just want to know what they are, at https://www.gilesthomas.com/2025/08/what-ai-chatbots-are-doi...

gpjt · 2025-09-06T23:09:42 1757200182

Check the first link in the parent comment, it's a link to the book.