[Edit] I seem to have turned this into somewhat of an information dump... Like o...

[Edit] I seem to have turned this into somewhat of an information dump...

Like other commenters said, you typically find those out by just trying them out one by one and seeing what works. However, you can prune the search space considerably given you know a few things. These range from theory, to large experimental results. For example, if google or someone widely deploys a certain configuration, other people just use that. If large experiments show that this and this setting for Adam works well for NLP, other people just use that when working on NLP problems. There was a large experiment done that showed that the best activation functions were of the form alphasigmoid(betax). Sigmoid, tanh, Gelu, are all of this form. Stuff like this is unfortunately the majority of the knowledge. In fact, ReLU is being used without there even being a universal approximation theorem[1] for networks using it! The canonical one only works when sigmoids are used. No one cared, because it worked in practice.

Typically, theoretical results are difficult to come by for such a general model structure as neural networks. Think about it, a theoretical result "for all neural networks" has very little logical statements i.e constraints to work with, that will then combine to produce other logical statements. So, you would see theoretical results for a subset of architectures. This is because the constraints that generate this subset give us more to work with, and we can combine them in some way and give a theorem or proof. Then, people find out empirically that it works well for a more general network, too. An example of this type of result is "dropout". The empirical motivation for it was trying to train ensemble networks for cheap. In an attempt to rest it on some theoretical grounding, it was shown that for linear models it is equivalent to adding noise to the input, which can be shown to be a good regularizer. But there is no proof for more complex architectures. In practice, it works anyway. But, you're not sure, so you include it in your hyperparameter search.

There is some good theoretical grounding for many regularization methods. My favorite is the proof that the very straightforward L2 regularization on SGD, can be shown to exactly limit the unimportant features, while not regularizing much the important features. You can also search "stein's lemma neural networks". I found [2], which is a talk on this topic, and it is by Anima Anandkumar - always a good sign.

For activation functions, it is mostly that experimental result that everyone relies on.

The universal approximation theorem [1] says that even a single layer is enough to represent any function. However, there is a practical difficulty in training these single-layer networks. Deepening the network provides a lot of efficiency advantages. Notably, for certain classes of functions, it provides an exponential advantage (Eldan and Shamir 2016). There is a wishy-washy(IMHO) theory called the Information Bottleneck Theory, which tries to show that multiple layers stack on top of each other, each uncovering one level of "heirarchy" in the data distribution. This is seen in practice (see StyleNet) but the theory is a little weak, again IMHO.

There is also a lot of tweaks done to the architecture in the name of preventing the "Vanishing Gradients" problem - this is a problem that arises because we use backpropagation to train these networks. There is _some_ theory to help understand this, that comes out of random matrix theory. But I don't know much of it.

There is the old VC dimension theory of model complexity, but that doesn't cleanly apply to neural networks as far as I have seen.

[1] in case you are unaware, this is the theorem that makes pursuing neural networks sound in the first place. It says that you can always make a neural network that computes an arbitrary function up to an arbitrary precision threshold.

[2] https://slideslive.com/38917864/role-of-steins-lemma-in-guar...