We found evidence of specific layer-localized "reasoning" circuits in a few models last year too! A very much work-in-progress paper is here: https://openreview.net/forum?id=mTjGBrkdtz
Huh there is so much limonene in Coca Cola?! Limonene works as a very good…pesticide and herbicide! I did a research project on limonene like 10 years ago with my mentor and it outperformed most commercial pesticides in controlled settings. It really can't be that great to ingest.
These metaphorical database analogies bug me, and from what it seems like, a lot of other people in comments! So far some of the most reasonable explanations I have found that take training dynamics into account are from Lenka Zdeborova's lab (albeit in toy, linear attention settings but it's easy to see why they generalize to practical ones). For instance, this is a lovely paper: https://arxiv.org/abs/2509.24914
Hi! Did you ever end up running this reproduction? If yes, could you also check if the Putnam/IMO problems are in the training data perhaps by trying to have it complete the problems n times? I would totally do this myself if I weren’t GPU poor!
If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here's a quick visualization: https://arcagi-o3-viz.netlify.app
Hi! I've been working on theorem proving systems for some time now. I would love to help out with an AlphaProof reproduction, but I can't reach you on discord for some reason!
Thank you, those insights are invaluable! This is a specific and potentially dumb question and I completely understand if you can't answer it!
The practical motivation for MoEs is very clear but I do worry about loss of compositional abilities (that I think just emerge from superposed representations?) that some tasks may require, especially with the many experts phenomenon we're seeing. This is an observation from smaller MoE models (with like top-k gating etc.) that may or may not scale, that denser models trained to the same loss tend to perform complex tasks "better".
Intuitively, do you think MoEs are just another stopgap trick we're using while we figure out more compute, better optimizers or could there be enough theoretical motivation to justify their continued use? If there isn't, perhaps we need to at least figure out "expert scaling laws" :)
thanks for the thoughtful qtn! yeah i dont have the background for this one, you'll have to ask an MoE researcher (which Yi is not really either as i found out on the pod). it does make sense that on a param-for-param basis MoEs would have less compositional abilities, but i have yet to read a paper (mostly my fault, bc i dont read MoE papers that closely, but also researcher fault, in that they're not incentivized to be rigorous about downsides of MoEs) that really identified what these compositional abilities are that MoEs are affected. if you could, for example, identify subcategories of BigBench or similar that require compositional abilities, then we might be able to get hard evidence on this question. i'm not yet motivated enough to do this myself but it'd make a decent small research question.
HOWEVER i do opine that MoEs are kiiind of a stopgap (both on the pod and on https://latent.space/p/jan-2024) - definitely a validated efficiency/sparsity technique (esp see deepseek's moe work if you havent already, with >100 experts https://buttondown.email/ainews/archive/ainews-deepseek-v2-b...) but mostly a oneoff boost you get on the single small dense expert equivalent model rather than comparable to the capabilities of a large dense model of the same param count (aka I expect a 8x22B MoE to never outperform a 176B dense model ceteris paribus - which is difficult to get a like-for-like comparison on bc these things are expensive, also partially because usually the MoE is just upcycled instead of trained from scratch, and partially because the routing layer is deepening every month). so perhaps to TLDR there is more than enough evidence and practical motivation to justify their continued use (i would go so far as to say that all inference endpoints incl gpt4 and above should be MoEs) but they themselves are not really an architectural decision that matters for the next quantum leap in capabilities
I'm not sure how to quantify how quickly or well humans learn in-context (if you know of any work on this I'd love to read it!)
In general, there is too much fluff and confusion floating around about what these models are and are not capable of (regardless of the training mechanism.) I think more people need to read Song Mei's lovely slides[1] and related work by others. These slides are the best exposition I've found of neat ideas around ICL that researchers have been aware of for a while.
reply