Hacker News new | past | comments | ask | show | jobs | submit login

Anyone who’s done large scale model training like this, an you shed light on following questions:

What is the process like? Do you prototype locally? How do you confidence that the only limitation to good results is more compute power and NOT the model architecture or applicability of deep learning to a particular task? At what point do you decide that shelling many tens of thousands is OK? How often do you do large scale training only to find non-impressive results and hence the money wasted?




There’s a natural way to parallelize these models so that using 128 GPUs is the same as a 128x batch size. You can similarly simulate 128x batch size by accumulated gradients before backpropping. So you can test on just one or a few GPUs before you run the full thing.

By that point you know it’s going to work, it’s just a matter of how well and whether you could’ve done nominally better with different tuning.

There’s been enough research leading up to this paper to suspect that just scaling larger would play out.


Thanks.

>By that point you know it’s going to work, it’s just a matter of how well and whether you could’ve done nominally better with different tuning.

This can't be true in all cases, right? I'm assuming that for many initially promising results on less-compute when they scale it, the results aren't impressive. I'm very curious to know what is the trials-to-success rate of publishable results when big-compute is thrown in the mix.


It’s indeed a very high trials to success ratio. Again though, there’s enough papers preceding this one that you could have good confidence in the effort. Another thing that helps is orgs like OpenAI have their own servers, rather than renting ec2 instances.

You also don’t just launch that many things and them ignore it. You monitor it to make sure nothing is going terribly wrong.

But yeah there’s also the fact that if you’re Google, throwing $2m worth of compute at something becomes worth it for some reason (eg Starcraft)


I doubt 1.5B params will fit any single GPU. I think they spread parts of models between GPUs/TPUs similarly to mesh-tensorflow: https://arxiv.org/abs/1811.02084




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: