What kind of work are you referring to when you say higher-order SGD may _now_ be feasible for deep learning?
I only find results that try to approximate second order information.
Not sure what you mean. The paper above claims 1000x speedups for computing second-order derivatives. Have not tested their claims, but was speculating that such an improvement, if true, would make computing hessians for small networks fesiable. This is what I am referring to.
Are you referring to the Ricci notation when you are saying it uses a different representation of the tensors?
Do you also plan to add non-differentiable functions like relu?
Yeah, I am referring to Ricci notation. TF and PyTorch don't use it. The new version already has relus. This version is targeted at standard formulas (has abs as a non-differentiable function), next version works for deep learning. Never wanted to work on this but there is just no way around it.
I wonder why they don't do it. In the tool I had running, we handled this by removing any transpose of a symmetric matrix (after propagating it before the leaves). Together with the simplification rule x + x -> 2*x for any x, you get the expected result. I could only guess why they didn't include it in the online matrix calculus tool. It was published after I left the group.
It is not in the current online tool but we will add it again soon. It is still in there the way you describe it (passing transpose down to the leaves and simplification rules as well). Btw: How are you doing and where have you been? Would be nice to also add a link to you and your current site. GENO is also on its way.
I'm good. Looking at things from the data angle now. But unfortunately no public page. You can link to the old one, if you want to. Have you compared against TensorFlow XLA?
I did not compare to Tensorflow XLA but I compared it to Tensorflow. Of course, it depends on the problem. For instance, for evaluating the Hessian of x'Ax MC is a factor of 100 faster than TF. But MC and TF have different objectives. TF more on scalar valued functions as needed for deep learning, MC for the general case, especially also vector and matrix valued functions as needed for dealing with constraints. But I will give TF XLA a try.
I think XLA is trying to reduce the overhead introduced by backprop, meaning when you optimize the computational graph you might end up with an efficient calculation of the gradient (closer to the calculation you get with MC).
Regarding non-scalar valued functions: Don't you reduce a constrained problem to a series of unconstrained problems (via a penalty (or even augmented Lagrangian) or barrier method)? Then you only need the gradient of a scalar valued function to solve constrained problems. I imagine you can use the gradients of the constraints for solving the KKT conditions directly but this seems only useful if the system is not too big. But for sure it opens new possibilities.
XLA is good for the GPU only. On the CPU MC is about 20-50% faster than TF on scalar valued functions. For the GPU I don't know yet. But it is true that for augmented Lagrangian you only need scalar valued functions. This is really efficient on large-scale problems. But on small-scale problems (up to 5000 variables) you really need interior point methods that solve the KKT conditions directly as you point out. This is sometimes really needed.
However, when you look at the algorithms TF and MC do not differ too much. In the end, there is a restricted number of ways of computing derivatives. And basically, most of them are the same (or boil down to two versions). Some of the claims/problems made in the early autodiff literature concerning symbolic diff is just not true. In the end, they are fairly similar. But lets see how XLA performs.