Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My educated guess is that they use a MoE-style model similar to the Switch transformer[0], and combine a similar encoding as that of Kosmos-1[1] (with an “image” latch token, and a ViT-style transformer to process images). As a result, the parameter count is likely bigger, but since not all of them are involved in a forward pass, it is not as meaningful.

[0]: https://arxiv.org/pdf/2302.14045.pdf

[1]: https://arxiv.org/pdf/2101.03961.pdf



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: