Wild considering, GPT-4 is 1.8T.

andy99 · on April 18, 2024

Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.

acchow · on April 18, 2024

Also, the technological leader focuses less on the benchmarks

manmal · on April 18, 2024

Interesting claim, is there data to back this up? My impression is that Intel and NVIDIA have always gamed the benchmarks.

jgalt212 · on April 18, 2024

NVIDIA needs T models not B models to keep the share price up.

karmasimida · on April 18, 2024

Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along

HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.

Hook this up to LLM arena we will get a better picture regarding how powerful they really are

bilbo0s · on April 18, 2024

"graduate student descent"

Ahhh that takes me back!

qeternity · on April 18, 2024

The original GPT4 may have been around that size (16x 110B).

But it's pretty clear GPT4 Turbo is a smaller and heavily quantized model.

IceHegel · on April 19, 2024

Yeah, it’s not even close to doing inference on 1.8T weights for turbo queries.

oersted · on April 18, 2024

Where did you find this number? Not doubting it, just want to get a better idea of how precise the estimate may be.

refulgentis · on April 18, 2024

It's a really funny story that I comment about at least once a week because it drives me nuts.

1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.

2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.

3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.

4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.

5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.

my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.

cjbprime · on April 18, 2024

You're ignoring geohot, who is a credible source (is an active researcher himself, is very well-connected) and gave more details (MoE with 8 experts, when no-one else was doing production MoE yet) than the Twitter spam.

anoncareer0212 · on April 18, 2024

Geohot? I know enough people at OpenAI to know 4 people's reaction at the time he started claiming 1T based on timing latency in the ChatGPT webui per token.

In general, not someone you wanna be citing with lengthy platitudes, he's an influencer who speaks engineer, he's burned out of every community he's been in, acrimonously.

huijzer · on April 18, 2024

Probably from Nvidia's GTC keynote: https://www.youtube.com/live/USlE2huSI_w?t=2995.

In the keynote, Jensen uses 1.8T in an example and suggests that this is roughly the size of GPT-4 (if I remember correctly).

sputknick · on April 18, 2024

I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T

qeternity · on April 18, 2024

I'm pretty sure he suggested it was a 16 way 110 MoE

brandall10 · on April 19, 2024

The exact quote: "Sam Altman won’t tell you that GPT 4 has 220 billion parameters and is a 16 way mixture model with eight sets of weights."

cjbprime · on April 18, 2024

It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.

That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.

I think it's not really defined how to compare parameter counts with a MoE model.

wongarsu · on April 18, 2024

But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.

Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.

staticman2 · on April 22, 2024

"For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models"

Are you sure about that? I'm pretty sure Miqu (the leaked Mistral 70b model) is generally thought to be smarter than Mixtral 8x7b.

worldsayshi · on April 18, 2024

What is the reason for settling on 7/8 experts for mixture of experts? Has there been any serious evaluation of what would be a good MoE split?

nycdatasci · on April 18, 2024

It's not always 7-8.

From Databricks: "DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."

https://www.databricks.com/blog/introducing-dbrx-new-state-a...

wongarsu · on April 18, 2024

A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.

Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.

You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.

cjbprime · on April 19, 2024

(For a model of GPT-4's size, it could also be 8 nodes with several GPUs each, each node comprising a single expert.)

chessgecko · on April 18, 2024

I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.

anvuong · on April 18, 2024

I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.

Simon321 · on April 18, 2024

It's a mixture of experts model. Only a small part of those parameters are active at any given time. I believe it's 16x110B