Hacker Newsnew | past | comments | ask | show | jobs | submit | Chamix's commentslogin

Try 10s of trillions. These days everyone is running 4-bit at inference (the flagship feature of Blackwell+), with the big flagship models running on recently installed Nvidia 72gpu rubin clusters (and equivalent-ish world size for those rented Ironwood TPUs Anthropic also uses). Let's see, Vera Rubin racks come standard with 20 TB (Blackwell NVL72 with 10 TB) of unified memory, and NVFP4 fits 2 parameters per btye...

Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)

So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.


Nobody is running 10s of trillion param models in 2026. That's ridiculous.

Opus is 2T-3T in size at most.


What do you think labs are doing with the minimum 10TB memory in NvLink 72 systems that were publicly reported to all start coming online in November/December of last year? And why would this 1 TB -> 10 TB jump matter so much for Anthropic previously being wholly dependent on running Opus 4x on TPUs, if the models were 2-3T at 4bit and could fit in 8x B200 (1.5 TB = 3T param) widely deployed during the Opus 4 era?

You have presented a vibe-based rebuttal with no evidence or or logic to outline why you think labs are still stuck in the single trillions of parameters (GPT 4 was ~1 trillion params!). Though, you have successfully cunninghammed me into saying that while anything I publicly state is derived from public info, working in the industry itself is a helpful guide to point at the right public info to reference.


Could you point at some more public info about active parameter count? You said:

> and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

I can see ~100B, but that would near the same order of magnitude. I find ~1000B active parameters hard to believe.


Sorry if that was unclear, I did mean 100Bs as in the next order of magnitude. Even GPT-4 had ~220B active params, though the trend has been towards increased sparsification (lower activation:total ratio). GPT 4.5 is the only publicly facing model that approached 1T active parameters (an experiment to see if there was any value in the extreme inference cost of quadratically increasing compute cost with naïve-like attention). Nowadays you optimize your head size to your attention kernel arch and obtain performance principally through inference time scaling (generate more of tokens) and parallel consensus (gpt pro, gemini deep think etc), both of which favor faster, cheaper active heads.

4o and other H100 era models did indeed drop their activated heads far smaller than gpt-4 to the 10s just like current Hopper-Era Chinese open-source, but it went right back up again post-Blackwell with the 10x L2 bump (for kv cache) in congruence with nlogn attention mechanisms being refined. Similar story for Claude.

The fun speculation is wondering about the true size of Gemini 3's internals, given the petabyte+ world size of their homefield IronwoodV7 systems and Jim Keller's public penchant for envisioning extreme MoE-like diversification across hundreds of dedicated sub-models constructed by individual teams within DeepMind.


Well, for one, Anthropic mostly uses Google TPUs and Amazon Inferentia2 chips, not Nvidia NVL72s. That's because... Google and Amazon are major investors in Anthropic.

Secondly, you missed out the entire AI industry trend in 2024-2025, where the failure of the GPT-4.5 pretrain run and the pullback from GPT-4 to GPT-4 Turbo to GPT-4o (each of which are smaller in parameter count). GPT-4 is 1.6T, GPT-4 Turbo is generally considered 1/2 to 1/4 that, and GPT-4o is even smaller (details below)

Thirdly, we KNOW that GPT-4o runs on Microsoft Maia 100 hardware with 64GB each chip, which gives a hard limit on the size of GPT-4o and tells us that it's a much smaller distilled version of GPT-4. Microsoft says each server has 4 Maia 100 chips and 256GB total. We know Microsoft uses Maia 100s to serve GPT-4o for Azure! So we know that quantized GPT-4o fits in 256GB, and GPT-4 does not fit. It's not possible to have GPT-4o be some much larger model that requires a large cluster to serve- that would drop performance below what we see in Azure.

Fourthly, it is not publicly KNOWN, but leaks say that GPT-4o is 200b-300b in size, which also tells us that running GPT-4 sized models is nonsense. This matches the information from Microsoft Maia servers above.

Fifthly, OpenAI Head of Research has since confirmed that o1, o3, GPT-5 use the same pretrain run as 4o, so they would be the same size.[1] That means GPT-5 is not some 1T+ model! Semianalysis confirms that the only pretrain run since 4o is 4.5, which is a ~10T model but everyone knows is a failed run.

Sixthly, Amazon Bedrock and Google Vertex serves models at approximately similar memory bandwidths when calculating tokens/sec, giving 4900GB/sec for Google Vertex. Opus 4.5 aligns very well with 100b of active params.

    42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
    143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
    70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B. There's calculations for Amazon Bedrock on the Opus 4.5 launch thread that compares it to gpt-oss-120b with similar conclusions.

Seventhly, Anthropic distilled Opus 4/4.1 to 4.5, which is why it runs ~3x faster than Opus 4 while costing 1/3 the price in terms of API fees.

Eightly, no respectable model has a sparsity below 3% these days- ridiculously low sparsity gives you Llama 4. Every single cutting edge model are around 3-5% sparsity. Knowing the active param count for Opus 4.5 gives you a very good estimate of total param count.

The entire AI industry is moving AWAY from multi-trillion-parameter models. Everything is about increasing efficiency with the amount of parameters you have, not hyperscaling like GPT-4.5 which was shown to be a bad way forward.

Nobody thinks Opus 4.5 is bigger than around 2T in size (so not 10T). Opus 4/4.1 may have been ~6T, but that's it. Any guess of 10T or above is patently ridiculous for both Opus 4/4.1 and Opus 4.5.

[1] https://x.com/petergostev/status/1995744289079656834


I appreciate the detailed comment! I took the day off and am bored so have a brain dump of a reply - basically I think we are talking past each other on two major points:

1. All the discussion about model size is CRITICALLY bisected into talking about TOTAL model size vs ACTIVE parameter size (of a "head" in an "Mixture of Experts"). Everything you've said trend-wise is mostly accurate for ACTIVE parameter count, which is what determines inference cost and speed.

But I am primarily talking about TOTAL parameter count (which has to just fit inside cluster HBM). The total parameter count only affects training cost and has nothing to do with inference cost or speed. So there is no downside to making total parameter count as big as your inference cluster will fit.

2. You touch on distllation, and this heavily relates to the post-gpt-4 base model (call it 5th gen, if gpt-4 was 4th gen), which indeed was used for all models through gpt5.1.

The actual base 5th gen model was as large as OAI could fit on training clusters, and only then distilled down to whatever total size a release model targeted, and the little secret with sparse MOE is the entire model weights don't have to fit (again, plenty of public papers detailing techniques) on a single HBM pool when training. This leads to the 2nd little secret, that GPT-4.5 is ALSO using that same base model; as I said in another comment, 4.5 was all an experiment in testing a huge ACTIVE parameter model (which again is all that determines cost and speed), not so much total (which is capped by inference cluster hardware anyways!) How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else? But its easy to serve a model with active parameters 10x bigger!

So this same huge 5th gen base model was distilled down and RLed over and over again in different permutations and sizes to feed the whole OAI model lineup, from o4-mini to advanced voice to gpt4.5 all the way until finally 5.2 starts using a new, "6th gen" base model (with various failed base model trainings between 5th and 6th) (shallotpeat!).

Picking up misc pieces, yes 4o was tiny when served at Q4, which is what Maia 100 did (with some Q6). We are still taking about a ~1T total model. Quantization both static and dynamic was the whole drive behind gpt4-turbo variants which led straight into 4o targeting an extremely economical deployment of 5th gen base. Economical was sorely needed (arrakis!) since this all was at the critical junction when 8xH100s had not been deployed quite at scale yet, but AI use was rocketing off to mainstream, so we had silly situations like Azure being forced to serve on 256gb clusters. (We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)

But this DOES NOT mean o1 was tiny, which conveniently was deployed right when 8xH100s WERE available at scale. We split into the instant tree, where 4.1 was bigger than 4o and 5-instant was bigger than 4.1 etc. And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking. Again, the ACTIVE counts were very small comparatively, especially as it let you cheaply experiment and then train with substantial inference compute required for RL training/unrolling! But there was no reason not to fit increasingly large distilled versions of the 5th-gen/6th-gen base models as the inference fleet buildouts (particularly in 2H 2025) came online! The same 5th and now 6th gen base models were refined and twisted (foundry!) into totally different end models and sizes.

I just think this really all comes down to total vs active, not understanding a huge base model can be distilled into arbitrarily sized release models, and then bizarrely giving weight to Meta's completely incompetent Llama 4 training run (I was there, Gandalf!) as giving any sort of insight on what sort of sparsity ratio cutting edge labs are using. You cannot learn anything about total parameter size from active parameter count+ derivatives (token speed, cost, etc)! But on this topic we could again diverge into an entire debate; I'll just say Google is likely doing like 0.1%-OOM in some production configs (Jim Keller is basically shouting extreme sparsity from the rooftops!).

Brief rebuttal summary:

1. Incorrect as of late 2025. Whole public reporting about Anthropic dissatisfaction with "Project Ranier". Dario talking about Nvidia compute candidly on Dwarkish interview!

2. Active vs Total

3. 4o is small, 4-bit 4o on Azure even smaller. 4o is 5th gen base distilled not gpt-4 distilled.

4. 256gb at Q4 fits 1T parameters! Active vs total

5. 5th gen pretrain / base model is huge! 4.5 uses the same base as 4o and 5.1! Can be shrunk to arbitrary size before RL/post training create finished model! Active vs total

6. Active vs total

7. Active vs total, also Ironwood/TPUv7 and Blackwell give much cheaper Q4 inference

8. Don't trust the Zuck

Anyways its all a mess and I don't think its possible to avoid talking past each other or misunderstanding in semi-casual conversation - even just today Dylan Patel (who is extremely well informed!) was on Dwarkesh podcast talking about 5.4-instant having a smaller active parameter count than GPT-4 (220B active), which is completely true, but instantly gets misinterpreted on twitter et al that 5.4 is a smaller model than gpt-4, ignores that 5-4.instant are 5.4-thinking are totally different models, etc etc, just too much nuance to easily convey.


1. Claiming that gpt-4o and gpt-4.5 came from the same training run is ridiculous, gpt-4.5 was not distilled from the same pretrain as 4o.

- Mark Chen has literally publicly said as much, it's a completely different pretrain run.

- And clearly if openai has a good big base model before 4.5, they would have released it back in 2024.

"How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else?" through pipeline parallelism, not tensor parallelism. Don't need to synchronize an all-reduce across clusters. You lose tons of tokens/sec per user though. That's exactly what we see with gpt-4.5 in real life- slow ~10token/sec inference.

2. 4o was definitely not served fully at 4-bit/6-bit, and even at 4-bit a 1T model wouldn't fit in a Maia cluster with reasonable kv cache for users. You can't quant attention down to 4-bit/6-bit, that would give the model brain damage. A production environment would quant attention down to fp8 at most. Even local home users don't quant attention down to 4 bit. Unsloth UD Q4 quants usually quant attention to Q8. https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/blob/mai...

blk.0.attn_qkv.weight [2 048, 8 192] Q8_0

Also, Q4/Q6/Qwhatever are quants used by llama.cpp only, and nobody in a production environment would be using llama.cpp at all. So, saying "Qwhatever" is a clear indicator you have no clue what you're talking about.

Since 4o predates widespread MLA, they're clearly using GQA and thus you can estimate the size per token from an approximate attention head size. Note that Azure offers 4o with max context of 128k tokens. That's about 4-8gb kv cache at full context size. Even at 4bit (it's not at 4bit), 4o is 500b at most, if you actually want to serve customers! Providers do not do batch=1 inference, that would leave the GPU core idle while memory bandwidth is saturated. So they'd have to batch many users onto one machine, with all their kv caches resident in memory. There's just no way you can fit a 1T model with 8+ bit attention and a bunch of users' kv cache into 256GB, even if the ffn was fp4.

3. Microsoft leaked the size of 4o, you know. And there's also other estimates. They all estimate 4o at around 200b. https://arxiv.org/pdf/2412.19260 or https://epoch.ai/gradient-updates/frontier-language-models-h...

4. "(We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)"

More accurately, most deployments are FP4 for ffn, and still 8 bit or 16 bit for attention. And only the chinese labs train at FP8. There's very little reason to train at FP8 when your AdamW states and gradients are still FP32 and FP16. And note that even deepseek uses FP16/FP32 AdamW/gradients.

https://arxiv.org/pdf/2412.19437 That's deepseek using FP8 live weight copy + FP32 master + FP32 grad + BF16 moments = 13 bytes per parameter. BF16 weights is 14 bytes per parameter. There's very reason to use FP8 weights over BF16 weights during training, you don't save that much VRAM/compute, unless you're very desperate like Deepseek. Most labs now still train for W16A16 but apply QAT, not train at FP8. Even the chinese labs do this now- Kimi K2.5 is BF16 native, and just quantize ffn down to int4 with QAT. You can tell, because Kimi K2.5 attention tensors are BF16 and not FP8.

4. "instant tree" "And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking." What you're describing is a massive waste of money. Nobody's doing that. Each time you distill a model to a different size, you have to do that separately. That's a waste of compute. Nope, openai just took the same model, and kept on posttraining it more and more, and published some checkpoints. That's what everyone does. The various gpt-4o-2024-05-13 and gpt-4o-2024-08-06 and gpt-4o-2024-11-20 and gpt-5 and gpt-5.1 ... and o1 and o3 and gpt-5-thinking models are NOT different sizes.

Every lab takes a model, and iterate on it and train it more and more. Creating a bunch of distills is expensive. Training a model compute is approximately Compute ≈ 6(number of active params)(tokens trained). Posttraining is basically just throwing a few more tokens into the model and doing some forward and backwards passes. I don't know how many tokens they trained on, but it's somewhere in the 10T to 100T range. Distilling a model compute ≈ [2(big model active params) +6(small model active params)](tokens trained). This is way more expensive per token than training! There's less passes, but you don't get the value you think from distills.

Look at Deepseek! Deepseek V3? 671B total parameters checkpoint. R1? 671B total parameters checkpoint. V3 0324? 671B total parameters checkpoint. R1 0528? 671B total parameters checkpoint. V3.1 combined thinking and non-thinking? 671B total parameters checkpoint. V3.1 Terminus? 671B total parameters checkpoint. V3.2? 671B total parameters checkpoint.

5. Sparsity matters. Nobody currently is going below 1% sparsity.

MoE sparsity is just the ratio of total number of experts to active experts. Most labs settle on around 8 out of 256 (like Deepseek, GLM, etc) aka 6.25%. There's plenty of research showing that models break down at too high of a sparsity, which is why total params is correlated to active params.

Also, please don't use the word "head" to refer to a MoE expert. The word "head" has a specific meaning in ML and it's not that. It's referring to the component in multi-head attention. That's like using the word "transmission" when talking about a car but not referring to the actual transmission. It's making you look really weird.

Actually, we know what architecture openai was using a few years ago- because openai released it. That was the whole point of gpt-oss. Notably, it uses mxfp4 for MoE, but still uses BF16 for GQA attention, and it has 4 of 128 experts sparsity. Yes, even OpenAI realized that staying around 6.25% sparsity is a good idea. And note that OpenAI clearly did not think quantizing attention is a good idea, even if they apply QAT to create a mxfp4 ffn.

Basically, you have no clue what you're talking about. You're somehow claiming that openai is doing a ton of distills, one for each of 4o/o1/o3/gpt-5/gpt-5.1 thinking and nonthinking, to different sizes... instead of just taking a model they already have, and doing more training and more checkpoints like everyone else. They'd be insane if they were doing that.


Do you have any clues to guess the total model size? I do not see any limitations to making models ridiculously large (besides training), and the Scaling Law paper showed that more parameters = more better, so it would be a safe bet for companies that have more money than innovative spirit.

> I do not see any limitations to making models ridiculously large (besides training)

From my understanding, the "besides training" is a big issue. As I noted earlier[1], Qwen3 was much better than Qwen2.5, but the main difference was just more and better training data. The Qwen3.5-397B-A17B beat their 1T-parameter Qwen3-Max-Base, again a large change was more and better training data.

[1]: https://news.ycombinator.com/item?id=47089780


China is targeting H20 because that's all they were officially allowed to buy.

I generally agree, back of the napkin math shows H20 cluster of 8gpu * 96gb = 768gb = 768B parameters on FP8 (no NVFP4 on Hopper), which lines up pretty nicely with the sizes of recent open source Chinese models.

However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.

This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.


They do not have enough H200 or Blackwell systems to server 1.6 billion people and the world so I doubt it's in any meaningful number.

I assure you, the number of people paying to use Qwen3-Max or other similar proprietary endpoints is far less than 1.6 billion.

You don't need to assure me. It's a theoretical maximum.

You, know, it sure does add some additional perspective to the original Anthropic marketing materia... ahem, I mean article, to learn that the CCC compiled runtime for SQLite could potentially run up to 158,000 times slower than a GCC compiled one...

Nevertheless, the victories continue to be closer to home.


Indeed, and the difference could in essence be achieved yourself with a different system prompt on 4o. What exactly is 4.5 contributing here in terms of a more nuanced intelligence?

The new RLHF direction (heavily amplified through scaling synthetic training tokens) seems to clobber any minor gains the improved base internet prediction gains might've added.


It's interesting to compare the cost of that original gpt-4 32k(0314) vs gpt-4.5:

$60/M input tokens vs $75/M input tokens

$120/M output tokens vs $150/M output tokens


Forgive me if I'm missing your existing realization (I did a quick check of your HN, reddit, twitter, LW), but I think the big deal with Sohu (wrt Etched) is that they have pivoted from the "all model parameters hard etched onto the chip" to "only transformer(matmul etc) ops etched onto the chip".

Soho does not have the LLaMA 70b weights directly lithographed onto the silicon, as you seem? to be implying with attachment to that 6month old post.

Seems like a sensible pivot; I'd imagine they're rather up to date on the pulse of dynamically updated nets potentially being a major feature in upcoming frontier models, as you've recently been commentating on. However, I'm not deep enough in it to be sure how much this removes their differentiation vs other AI accelerator startups.


I was thinking about the llm writing tool from Janus.


4chan already has a torrent out, of course.


The little secret is that the training run (meaning, creating the raw autocompleting multimodal token weights) for 5 ran in parallel with 4.


Luckily Eliezer has written hundreds of approachable essays on the development of his epistemic processes over at lesswrong.com so you too can learn rationality and derive the killeveryonism conclusion yourself.

(/s since this is the internet)


You are conflating Illya's belief in the transformer architecture (with tweaks/compute optimizations) being sufficient for AGI with that of LLMs being sufficient to express human-like intelligence. Multi-modality (and the swath of new training data it unlocks) is clearly a key component of creating AGI if we watch Sutskever's interviews from the past year.


Yes, I read "Attention Is All You Need", and I understand that the multi-head generative pre-trained model talks about "tokens" rather than language specifically. So in this case, I'm using "LLM" as shorthand for what OpenAI is doing with GPTs. I'll try to be more precise in the future.

That still leaves disagreement between Altman and Sutskever over whether or not the current technology will lead to AGI or "superintelligence", with Altman clearly turning towards skepticism.


Fair enough, shame "Large Tokenized Models" etc never entered the nomenclature.


Some terms I've seen used for the technology:

Big-Data Statistical Models

Stochastic Parrots or parrot-tech

plausible sentence generators

glorified auto-complete

cleverbot

"a Blurry JPEG of the Web" <https://www.newyorker.com/tech/annals-of-technology/chatgpt-...>

and just plain ol' "machine learning"


Do you have a link to one of these talks?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: