Distributing a Fully Connected Neural Network Across a Cluster

ajtulloch · on Nov 25, 2014

How is this on the front page? This is a completely incoherent.

For anyone actually interested in some interesting techniques for multi-GPU DNN training, http://arxiv.org/pdf/1404.5997v2.pdf and references therein are probably a good start.

iamtrask · on Nov 25, 2014

This also might help... here are some slides graphically showing how the distribution works. http://prezi.com/hdctecihctdr/?utm_campaign=share&utm_medium...

herewego · on Nov 25, 2014

Your condescension here is entirely unnecessary. Surely someone as qualified as you could have provided a more thoughtful and encouraging comment.

iamtrask · on Nov 25, 2014

i apologize for the verbosity and thickness. Happy to answer questions though. :)

dhaivatpandya · on Nov 25, 2014

The exposition is not very clear. What exactly do you mean when you say "No edges will be communicated over the network, only half of the nodes."? I'm puzzled, because a few sentences later, you claim "The only network IO that would be required would be sending each edge value to its respective node in Q."; so the edge values are actually communicated?

From what I've understood, what you're suggesting is that for every node in a layer, you colocate the edge on the same machine?

iamtrask · on Nov 25, 2014

Precisely! I highly encourage checking out the slide-deck for a graphical representation.

For every node in every other layer, I colocate the edge on the same machine. In this way, when a group of, say, 10 nodes in layer 1 are each sending a weighted message to a single node in layer 2... they can pre-combine their messages (weighted sum) and send only that value over the network. This happens for every node in the second layer, reducing network i/o (this is the first optimization).