Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Before I read this, is this yet another paper where physicists believe that modern NN are trained by gradient descent and consist of only linear FNN (or another simplification) and can therefore be explained by <insert random grad level physics method here>? Because we already had hundreds of those.

Or is this more substantial?



What do you mean with "modern NN" that differ from the basic FNN?

How do you train a modern NN if not through backpropagation?


Fnn are easy for sequential data because there is no relationship between the elements. Hence, these NN are a frequent target of simplified analyses. However, they also never led to exciting models which we now call AI.

Instead, real models are an eclectic mix of attention or other sequential mixers, gates, ffn, norms and positional tomfoolery.

In other words, everything that makes AI models great is what these analyses usually skip. Of course, while wildly claiming generalized insights about how AI really works.

There’s a dozen papers like that every few months.


> believe that modern NN are trained by gradient descent

Are they not?

Genuine question. I'm very new to machine learning and neural networks.


>> > believe that modern NN are trained by gradient descent

>> Are they not?

While technically true, that answer offers almost zero insight into how they work. Maybe another way to say it is that during inference there is no gradient descent happening - the network is already trained. Ignoring that gradient descent might be an overgeneralization of the training process, it tells you nothing about how ChatGPT plays chess or carries a conversation.

Telling someone what methodology was used to create a thing says nothing about how it works. Just like saying our own brain is "a product of evolution" doesn't tell how it works. Nor does "you are a product of your own life experience" put psychologists out of business. "It's just gradient descent" is a great way to trivialize something that nobody really seems to understand yet.


There proposal is actually that gradient descent is a bad representation of the NN learning behavior, and that instead this NFM tracks the “learning” behavior of the model better than simply its loss function over training.


Why is spinning a physics lens to NN a bad thing?


It isn't, but the kinds of paper referred to make so many assumptions that the conclusions lack external validity for actual deep learning. In other words, they typically don't actually say anything about state-of-the-art deep networks.


It's yet another.. just more adds..




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: