Is drop out still empirical or are there any proof of why it works in the overall model?
I recall reading up on CNN and playing around with it and it was interesting to add random drop off in there but was never explained why it works. I think the general thinking of why it works is that the network is overfitting so randomly dropping node is required for generalization?
Addressing your second question.
Informally, dropping nodes fights overfitting by creating subsample architectures of which are essentially thinned out networks of the one you've designed. Having trained on these sub nets means you've effectively combined the learning of a few different models and in doing so have generalized beyond the capabilities of your original "single" architecture.
My understanding is that it avoids overfitting when data points are highly correlated.
For example, if you use image augmentation to generate additional data, your augmented images are going to be highly correlated to their parent image leading to overfitting of the data. By using random dropout, this overfitting can be somewhat mitigated.