Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.
Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along
HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.
Hook this up to LLM arena we will get a better picture regarding how powerful they really are
It's a really funny story that I comment about at least once a week because it drives me nuts.
1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.
2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.
3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.
4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.
5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.
my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.
You're ignoring geohot, who is a credible source (is an active researcher himself, is very well-connected) and gave more details (MoE with 8 experts, when no-one else was doing production MoE yet) than the Twitter spam.
Geohot? I know enough people at OpenAI to know 4 people's reaction at the time he started claiming 1T based on timing latency in the ChatGPT webui per token.
In general, not someone you wanna be citing with lengthy platitudes, he's an influencer who speaks engineer, he's burned out of every community he's been in, acrimonously.
I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T
It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.
That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.
I think it's not really defined how to compare parameter counts with a MoE model.
But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.
Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.
From Databricks:
"DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."
A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.
Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.
You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.
I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.
I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.