Hacker News new | past | comments | ask | show | jobs | submit login

Wild considering, GPT-4 is 1.8T.



Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.


Also, the technological leader focuses less on the benchmarks


Interesting claim, is there data to back this up? My impression is that Intel and NVIDIA have always gamed the benchmarks.


NVIDIA needs T models not B models to keep the share price up.


Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along

HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.

Hook this up to LLM arena we will get a better picture regarding how powerful they really are


"graduate student descent"

Ahhh that takes me back!


The original GPT4 may have been around that size (16x 110B).

But it's pretty clear GPT4 Turbo is a smaller and heavily quantized model.


Yeah, it’s not even close to doing inference on 1.8T weights for turbo queries.


Where did you find this number? Not doubting it, just want to get a better idea of how precise the estimate may be.


It's a really funny story that I comment about at least once a week because it drives me nuts.

1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.

2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.

3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.

4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.

5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.

my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.


You're ignoring geohot, who is a credible source (is an active researcher himself, is very well-connected) and gave more details (MoE with 8 experts, when no-one else was doing production MoE yet) than the Twitter spam.


Geohot? I know enough people at OpenAI to know 4 people's reaction at the time he started claiming 1T based on timing latency in the ChatGPT webui per token.

In general, not someone you wanna be citing with lengthy platitudes, he's an influencer who speaks engineer, he's burned out of every community he's been in, acrimonously.


Probably from Nvidia's GTC keynote: https://www.youtube.com/live/USlE2huSI_w?t=2995.

In the keynote, Jensen uses 1.8T in an example and suggests that this is roughly the size of GPT-4 (if I remember correctly).


I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T


I'm pretty sure he suggested it was a 16 way 110 MoE


The exact quote: "Sam Altman won’t tell you that GPT 4 has 220 billion parameters and is a 16 way mixture model with eight sets of weights."


It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.

That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.

I think it's not really defined how to compare parameter counts with a MoE model.


But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.

Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.


"For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models"

Are you sure about that? I'm pretty sure Miqu (the leaked Mistral 70b model) is generally thought to be smarter than Mixtral 8x7b.


What is the reason for settling on 7/8 experts for mixture of experts? Has there been any serious evaluation of what would be a good MoE split?


It's not always 7-8.

From Databricks: "DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."

https://www.databricks.com/blog/introducing-dbrx-new-state-a...


A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.

Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.

You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.


(For a model of GPT-4's size, it could also be 8 nodes with several GPUs each, each node comprising a single expert.)


I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.


I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.


It's a mixture of experts model. Only a small part of those parameters are active at any given time. I believe it's 16x110B




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: