Conv layers are strictly special cases of FC layers (with respect to expressive power).
> For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
In general, if you can make assumptions about your problem and the form the solution takes, you can find a good solution with fewer parameters and fewer data, at the cost of suboptimal performance on problems that violate your assumptions. Conv layers are an example of this.
> The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
It's correct that the inferrence mode of a convolutional neural network can be reduced to the inference mode of a FC network. However, during training, you need to make sure that the weights are the same, and FC networks don't ensure that. So you need to train CNNs-embedded-in-FC slightly differently from normal FCs, otherwise what you're getting is not an CNN any more because the weights are different. What you need to do is to a) initialize all kernel entries with the same weights and b) instead of applying the weight changes directly, average the weight changes over all offsets and only then apply them to the specific kernel positions.
Absolutely–if you wanted to get an FC layer that satisfies Conv layer constraints, you would need to perform gradient descent subject to those constraints; normal gradient descent won't do that unless the actual optimum is of convolutional form. That's why I said (with respect to expressive power) :)
> For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
(per http://cs231n.github.io/convolutional-networks/)
In general, if you can make assumptions about your problem and the form the solution takes, you can find a good solution with fewer parameters and fewer data, at the cost of suboptimal performance on problems that violate your assumptions. Conv layers are an example of this.