Hacker News new | past | comments | ask | show | jobs | submit login

Would be interesting to explore whether evaluating ∇²f(x) for higher-order SGD methods directly is now feasible for smaller DNNs and whether this leads to faster convergence during training. Most methods like Newton or Gauss-newton were thought intractable for DNNs. Also curious if the angle descent prescribed by quasi-Newton methods is empirically closer to the true 2nd order gradient and whether these approximations are useful in wall-clock time.



What kind of work are you referring to when you say higher-order SGD may _now_ be feasible for deep learning? I only find results that try to approximate second order information.


Not sure what you mean. The paper above claims 1000x speedups for computing second-order derivatives. Have not tested their claims, but was speculating that such an improvement, if true, would make computing hessians for small networks fesiable. This is what I am referring to.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: