Qwen1.5-Moe: Matching 7B Model Performance with 1/3 Activated Parameters

kristianp · on April 10, 2024

It's actually a 14.3B parameter model. It's irritating that they don't follow the convention of naming the model according to size. Qwen1.5-MoE-A2.7B is named for the 2.7B activated parameters. I guess it helps to obfuscate the size, given that it performs about as well as Mistral 7B.

cchance · on March 30, 2024

Silly question, but is it possible for something similar to MoE for diffusion models like Stable Diffusion / DallE

kaliqt · on March 30, 2024

Something tells me that image models are sufficiently small, it's easier to just have your differently tuned models sitting side by side so you can easily swap and inference on them rather than compiling it into one model.

radq · on March 30, 2024

1/3rd "activated parameters", while also requiring 2x the VRAM.

YetAnotherNick · on March 30, 2024

That's the point of MoE. Sacrificing VRAM for compute/RAM bandwidth which makes it harder sell for consumer devices but easier for server devices where things are more likely to be compute or RAM bandwidth bound.

transformi · on March 29, 2024

HOW is it compare to phi-2?

sp332 · on March 29, 2024

Higher on MMLU (62.5 vs 56.7 for phi-2) and GSM8k (61.5 vs 61.1). https://www.microsoft.com/en-us/research/blog/phi-2-the-surp... The phi-2 numbers are for 5-shot MMLU and 8-shot GSM8k. The blog post doesn't get that specific for Qwen, but it's very likely they tested the same way.

Havoc · on March 30, 2024

Does anyone know the correct template & EOS tokens for this?

SushiHippie · on March 31, 2024

For the chat model

https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat/blob/8a2d...

hidelooktropic · on March 29, 2024

The demo link 404s.