As mentioned this is difficult. AFAIK the main reason is that the power of neura...

As mentioned this is difficult. AFAIK the main reason is that the power of neural nets come from the non-linear functions applied at each node ("neuron"), and thus there's nothing like the superposition principle[1] to easily combine training results.

The lack of superposition means you can't efficiently train one layer separately from the others either.

That being said, a popular non-linear function in modern neural nets is ReLU[2] which is piece-wise linear, so perhaps there's some cleverness one can do there.

[1]: https://en.wikipedia.org/wiki/Superposition_principle

[2]: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)