*I did have access to a cluster of GPUs through my professor's lab so compute wa...

I did have access to a cluster of GPUs through my professor's lab so compute wasn't as much of an issue.

Out of curiosity, what was the specific hardware you used? Some V100s, or maybe a DGX cluster?

Also, how many days did it take to get the loss down to acceptable levels? Did you aim for a loss of ~2.5, or less?

For now I'm trying to train it via 100 TPUv2-8's thanks to TFRC. Unfortunately, each TPUv2-8 is roughly 11x slower than a K80 GPU. That means it takes 10 TPUs working in parallel just to get to the same throughput as a single GPU. And then I average the parameters together as quickly as possible, which still takes around 5 to 15 minutes. (Training happens in parallel to all of that.)

It sort of seems to work, but it's hard to get the learning rate right. If it's set too high, various TPUs diverge. Too low and the loss stays constant.

But I imagine I'll crack it one of these days...