But from an output quality standpoint the total parameter count still seems more... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

wongarsu on April 18, 2024 | parent | context | favorite | on: Meta Llama 3

But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.

Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.

staticman2 on April 22, 2024 [–]

"For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models"

Are you sure about that? I'm pretty sure Miqu (the leaked Mistral 70b model) is generally thought to be smarter than Mixtral 8x7b.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact