And their bottleneck is what? Data transfer. State is gigantic and needs to be f...

mendigou · on May 14, 2024

Not disputing all of that, but telling the GP flat out "no" is incorrect, especially when distributed training and inference are the only way to run modern massive models.

mirekrusin · on May 15, 2024

Inference - you can distribute much better than training. You don't need specialized interconnects for inference.

The question was:

> > There is probably a simple answer to this question, but why isn't it possible to use a decentralized architecture like in crypto mining to train models?

> Can you copy a neural network, train each copy on a different part of the dataset, and merge them back together somehow?

The answer is flat out no.

It doesn't mean parallel computation doesn't happen. Everything, including single gpu, is massively parallel computation.

Does copying happen? Yes, but it's short lived and dominates, ie. data transfer is bottleneck and they go out of their ways to avoid it.

Distributing training in decentralized architecture fashion is not possible.