Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This paper seems somewhat suspect.

For one, "depth-three" implies three layers, and in the standard terminology of the field what they really mean is "depth one".

And another major red flag:

We give a polynomial-time algorithm for learning neural networks with one hidden layer of sigmoids feeding into any smooth, monotone activation function (e.g., sigmoid or ReLU)

How is ReLU smooth?



They cite https://arxiv.org/abs/1610.09887 for their definition of network depth, which defines it in such a way that e.g. a ReLU network of depth 2 is of the form linear2(ReLu(linear1(input))). That means, depth is the number of linear layers.

The "depth-three" model in this paper is a bit strange in that their second layer has only one output, so the third linear layer doesn't have any effect. I would have called this "depth two"; but it is internally consistent with their definition of depth.

> How is ReLU smooth?

It is 1-Lipschitz, which is smooth enough for them.


> The "depth-three" model in this paper is a bit strange in that their second layer has only one output, so the third linear layer doesn't have any effect. I would have called this "depth two"; but it is internally consistent with their definition of depth.

No, it does have an effect: It takes a linear combination of the outputs of the previous layer, and then applies a non-linearity $\sigma'$. If $\sigma'$ is the logistic function, then the output of the last layer is a probability.


No, the last layer is just a linear layer, there's no non-linearity. That's what's strange about their definition. The depth-three network only applies a non-linearity twice, which would conventionally be labeled as depth two.


You're wrong. See section 5.1. I've drawn a graph of it that shows it to be a classical NN with a single output node.


Yes, it is a classical NN with a single output node. I'm not disputing that, I just think their calculation of depth is strange. The network only applies the sigmoid function twice, and would ordinarily be regarded as having a depth of two. The third linear layer is fixed to multiplying by 1, which is what I meant by "has no effect". (Did you miss that I was talking about the third layer?)


Here's a diagram [1] of a one-hidden-layer NN. It has 2 activation functions. Their NN is of the same type.

[1] - https://raw.githubusercontent.com/qingkaikong/blog/master/40...


The question is, how deep is that network?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: