It's actually a 14.3B parameter model. It's irritating that they don't follow the convention of naming the model according to size. Qwen1.5-MoE-A2.7B is named for the 2.7B activated parameters. I guess it helps to obfuscate the size, given that it performs about as well as Mistral 7B.
Something tells me that image models are sufficiently small, it's easier to just have your differently tuned models sitting side by side so you can easily swap and inference on them rather than compiling it into one model.
That's the point of MoE. Sacrificing VRAM for compute/RAM bandwidth which makes it harder sell for consumer devices but easier for server devices where things are more likely to be compute or RAM bandwidth bound.
Higher on MMLU (62.5 vs 56.7 for phi-2) and GSM8k (61.5 vs 61.1). https://www.microsoft.com/en-us/research/blog/phi-2-the-surp... The phi-2 numbers are for 5-shot MMLU and 8-shot GSM8k. The blog post doesn't get that specific for Qwen, but it's very likely they tested the same way.