This paper seems somewhat suspect. For one, "depth-three" implies three layers, ...

yorwba · on Oct 30, 2017

They cite https://arxiv.org/abs/1610.09887 for their definition of network depth, which defines it in such a way that e.g. a ReLU network of depth 2 is of the form linear2(ReLu(linear1(input))). That means, depth is the number of linear layers.

The "depth-three" model in this paper is a bit strange in that their second layer has only one output, so the third linear layer doesn't have any effect. I would have called this "depth two"; but it is internally consistent with their definition of depth.

> How is ReLU smooth?

It is 1-Lipschitz, which is smooth enough for them.

man-and-laptop · on Oct 30, 2017

> The "depth-three" model in this paper is a bit strange in that their second layer has only one output, so the third linear layer doesn't have any effect. I would have called this "depth two"; but it is internally consistent with their definition of depth.

No, it does have an effect: It takes a linear combination of the outputs of the previous layer, and then applies a non-linearity $\sigma'$. If $\sigma'$ is the logistic function, then the output of the last layer is a probability.

yorwba · on Oct 30, 2017

No, the last layer is just a linear layer, there's no non-linearity. That's what's strange about their definition. The depth-three network only applies a non-linearity twice, which would conventionally be labeled as depth two.

man-and-laptop · on Oct 30, 2017

You're wrong. See section 5.1. I've drawn a graph of it that shows it to be a classical NN with a single output node.

yorwba · on Oct 31, 2017

Yes, it is a classical NN with a single output node. I'm not disputing that, I just think their calculation of depth is strange. The network only applies the sigmoid function twice, and would ordinarily be regarded as having a depth of two. The third linear layer is fixed to multiplying by 1, which is what I meant by "has no effect". (Did you miss that I was talking about the third layer?)

man-and-laptop · on Nov 1, 2017

Here's a diagram [1] of a one-hidden-layer NN. It has 2 activation functions. Their NN is of the same type.

[1] - https://raw.githubusercontent.com/qingkaikong/blog/master/40...

yorwba · on Nov 2, 2017

The question is, how deep is that network?