The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.
The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.
The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.