Can you copy a neural network, train each copy on a different part of the datase...

mirekrusin · on May 14, 2024

No. Training is offset relative to starting point. If you distribute it from same point you'll have bunch of unrelated offsets. It has to be serial - output state of one training is input state of the next.

If you could do it, we'd already have SETI like networks for AI.

mendigou · on May 14, 2024

I haven't touched this in a while, but you can train NNs in a distributed fashion and what GP described is roughly the most basic version of model parallelism, where there is a copy of the model on each node, each node receives a batch of data, and the gradients get synchronized after each batch (so they again start from the same point like you mention).

Most modern large models cannot be trained on one instance of anything (GPU, accelerators, whatever), so there's no alternative to distributed training. They also wouldn't even fit in the memory of one GPU/accelerator, so there are even more complex ways to split the model across instances.

mirekrusin · on May 14, 2024

And their bottleneck is what? Data transfer. State is gigantic and needs to be frequently synchronized. That's why it can only work with sophisticated, ultra high bandwidth, specialized interconnects. They employ some tricks here and there but they don't scale that well, ie. with MoE you get factor of 8 scaling and it comes at a cost of lower overall number of parameters. They of course do parallelism as much as they can at model/data/pipeline levels but it's a struggle in a setting of fastest interconnects there are on the planet. Those techniques don't transfer onto networks normal people are using, using "distrubuted" phrase to describe both is conflating those two settings with dramatically different properties. It's a bit like saying that you could make L1 or L2 cpu cache bigger by connecting multiple cpus with network cable. It doesn't work like that.

You can't scale averaging parallel runs much. You need to munch through evolutions/iterations fast.

You can't ie. start with random state, schedule parallel training averaging it all out and expect that you end up with well trained network in one step.

Every next step invalidates input state for everything and the state is gigantic.

It's dominated by huge transfers at high frequency.

You can't for example have 2x gpus connected with network cable and expect speedup. You need to put them on the same motherboard to have any gains.

SETI for example is unlike that - it can be easily distributed - partial readonly snapshot, intense computation, thin result submission.

mendigou · on May 14, 2024

Not disputing all of that, but telling the GP flat out "no" is incorrect, especially when distributed training and inference are the only way to run modern massive models.

mirekrusin · on May 15, 2024

Inference - you can distribute much better than training. You don't need specialized interconnects for inference.

The question was:

> > There is probably a simple answer to this question, but why isn't it possible to use a decentralized architecture like in crypto mining to train models?

> Can you copy a neural network, train each copy on a different part of the dataset, and merge them back together somehow?

The answer is flat out no.

It doesn't mean parallel computation doesn't happen. Everything, including single gpu, is massively parallel computation.

Does copying happen? Yes, but it's short lived and dominates, ie. data transfer is bottleneck and they go out of their ways to avoid it.

Distributing training in decentralized architecture fashion is not possible.

magicalhippo · on May 14, 2024

As mentioned this is difficult. AFAIK the main reason is that the power of neural nets come from the non-linear functions applied at each node ("neuron"), and thus there's nothing like the superposition principle[1] to easily combine training results.

The lack of superposition means you can't efficiently train one layer separately from the others either.

That being said, a popular non-linear function in modern neural nets is ReLU[2] which is piece-wise linear, so perhaps there's some cleverness one can do there.

[1]: https://en.wikipedia.org/wiki/Superposition_principle

[2]: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

bilbo0s · on May 13, 2024

There are a lot of issues with federated learning.

Really depends on your problem, but in practice, the answer is usually "no".

mendigou · on May 14, 2024

There are multiple ways to train in parallel, and that's one of them:

https://pytorch.org/tutorials//distributed/home.html