I know a lot depends on architecture and number representation, but do people have a sense for how big a compute cluster is needed to train these classes of models from 1.5B, 3B, 7B, 13B, 70B?
Didn’t Meta say they trained on 2k A100s for LLama 2?
I'm not an ML engineer, just interested in the space - but as a general ballpark, training these models from scratch needs hundreds to thousands of GPUs.
Didn’t Meta say they trained on 2k A100s for LLama 2?