Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.

The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.



> but it is not a rigorous proof of any kind.

Such is the nature of early theories.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: