My educated guess is that they use a MoE-style model similar to the Switch trans...

My educated guess is that they use a MoE-style model similar to the Switch transformer[0], and combine a similar encoding as that of Kosmos-1[1] (with an “image” latch token, and a ViT-style transformer to process images). As a result, the parameter count is likely bigger, but since not all of them are involved in a forward pass, it is not as meaningful.