Hacker News new | past | comments | ask | show | jobs | submit login
Qwen1.5-Moe: Matching 7B Model Performance with 1/3 Activated Parameters (qwenlm.github.io)
104 points by GaggiX on March 29, 2024 | hide | past | favorite | 10 comments



It's actually a 14.3B parameter model. It's irritating that they don't follow the convention of naming the model according to size. Qwen1.5-MoE-A2.7B is named for the 2.7B activated parameters. I guess it helps to obfuscate the size, given that it performs about as well as Mistral 7B.


Silly question, but is it possible for something similar to MoE for diffusion models like Stable Diffusion / DallE


Something tells me that image models are sufficiently small, it's easier to just have your differently tuned models sitting side by side so you can easily swap and inference on them rather than compiling it into one model.


1/3rd "activated parameters", while also requiring 2x the VRAM.


That's the point of MoE. Sacrificing VRAM for compute/RAM bandwidth which makes it harder sell for consumer devices but easier for server devices where things are more likely to be compute or RAM bandwidth bound.


HOW is it compare to phi-2?


Higher on MMLU (62.5 vs 56.7 for phi-2) and GSM8k (61.5 vs 61.1). https://www.microsoft.com/en-us/research/blog/phi-2-the-surp... The phi-2 numbers are for 5-shot MMLU and 8-shot GSM8k. The blog post doesn't get that specific for Qwen, but it's very likely they tested the same way.


Does anyone know the correct template & EOS tokens for this?



The demo link 404s.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: