rabreu08's comments

rabreu08 · on July 23, 2017

Looks like a good course. I think it would benefit if they added some module on implementing some basic Linear system of equations solvers, like gradient or steepest descent. Or even GMRES/MINRES or so.. The amout of knowledge that i gained from trying to implement these was remarkable.

jph00 · on July 23, 2017

Gradient descent is covered. We even show a pytorch implementation :)

rabreu08 · on July 23, 2017

Great! It's 2AM here and i totally missed it.

I will definitely give this course a try and spread the word. Thanks

thearn4 · on July 23, 2017

GMRES/Krylov spaces seem like they ought to have more of a place in deep learning. But maybe thats the mathematical part of me having a hammer and wanting to see everything as a nail.

One possibility that I haven't seen: if you have a neural network graph with a small subgraph that is not 100% explicit (i.e. I have nodes with cyclic dependency that have to be solved collectively via newton-type method), the gradient across this subgraph can be solved at the converged state by solving a standard matrix-vector linear inverse problem applied to the gradients, without needed the AD engine to follow every operation through the iterates of the non-linear newton solver.

For sparsity and conditioning reasons, in my current work we do something like this for graph-based specification of engineering systems using GMRES to find the gradients across these subsystems. We're not doing deep learning, but the underlying machinery is basically the same.

jph00 · on July 23, 2017

Krylov spaces often come up in derivations for optimization methods in DL, and have been studied as an explicit technique on their own. Eg see https://arxiv.org/abs/1111.4259

However more recent theoretical and empirical results show that these kinds of approaches don't deal well with the huge saddle point problem in DL loss functions. But momentum based techniques work great.

thearn4 · on July 23, 2017

Interesting, I'll have to look deeper at what has been done in this space. As I understand it, saddle points are handled by SGD-based methods because they are a bit more "noisy" in their steps through the problem space. In short, are the Krylov methods (or perhaps other approaches) a bit too precise in their attraction to critical points?

jph00 · on July 23, 2017

Exactly. Saddle points are like a magnet for them!

kxyvr · on July 23, 2017

Absolutely not. Krylov methods are one method for solving a linear system and can absolutely be used with stochastic optimization methods. Generally speaking, most optimization algorithms require the solution of some linear system and we can use a Krylov method or we can use a direct solve or we can integrate the direct solve with the Krylov method as a preconditioner. However, this has nothing to do with the stochastic piece of the algorithm. If we want to use a random projection or some kind of other technique to compile the gradient, we can absolutely do so while using a Krylov method.

Now, if we're just doing a stochastic steepest descent algorithm, there are not really any linear systems to solve, so this doesn't come up. If we want to use a stochastic Newton algorithm, this does. It comes up here specifically because forming the full Hessian may be too expensive in terms of computation or memory, so if we're smart and use a Krylov method that only requires Hessian-vector products, we can effectively use second-order information without taking the penalties of computation and memory. Yes, there is a loss here since we only use a piece of the spectrum of the Hessian and this is where the art and science and preconditioning come to play to insure that we use this effectively. However, we can use stochastic algorithms here. In fact, the entire field of stochastic PDE constrained optimization deals very precisely with these tools and does so in a surprisingly effective manner.

The advantage that Krylov methods have over a direct solve is that we retain the ability to use the direct solve if we want in the context of a preconditioner. However, different Krylov methods have different properties that we can use to our advantage. For example, conjugate gradient works on a minimization principle based on a quadratic form. This integrates well in trust-region methods that require us to minimize a quadratic model. GMRES minimizes the residual of the linear system, which gives us a monotonically decreasing residual at each iteration, which can be important for different algorithms. Which Krylov method to use in any particular situation depends on the properties desired and the systems involved. And, I can't stress enough, we don't lose out on direct solves. Direct solves can be fantastic preconditioners and if we really can compute an inverse exactly our Krylov iterations converges in one iteration. As such, we lose a little bit of computation and memory, but not that much.

Anyway, sorry for coming on a little bit hard, but I wanted to try and nip some amount of confusion in the bud quickly. Stochastic methods are naturally and fully integrated into inexact optimization schemes that rely on Krylov solves. They are the standard tool for many fields that use optimization. They may not be wildly used in ML, but that has little to do with their efficacy computationally.

jph00 · on July 24, 2017

Whilst what you say is accurate, it has nothing to do with my comment which you are replying to, which has specifically about the shape of the loss functions in deep learning, and recent research on how it impacts optimization of these functions.

kxyvr · on July 24, 2017

The grand parent comment asked, "In short, are the Krylov methods (or perhaps other approaches) a bit too precise in their attraction to critical points?" to which you replied, "Exactly. Saddle points are like a magnet for them!" There's little room for ambiguity here and this is categorically false. Your linear solver has nothing to do with saddle points and Krylov methods do not attract them.

rabreu08 · on July 5, 2017

I also can name a few things that are dominated by woman, or latin, jews, etc...

Is this really a problem?

I think that some of these lines of thought, while are valid points to justify and individual problem, tend to fall into a slippery slop that tends to beleving that a true (ideal) society is all greyed out with everyone in the same mold: genderless, opinionless, etc...

Latty · on July 6, 2017

There is nothing inherently wrong with a lack of diversity, but it is almost always a symptom of an issue with society or systems limiting people. For example, I would argue that primary school teachers being almost entirely women is indeed an issue with systemic and social sexism.

You have completely missed my point. The point is that everyone should have opportunity if they want to do something and are capable of it, not some kind of homogenization. Women are currently actively pushed away from the industry. If women don't want to be in tech, that's fine, the issue is that women that do get pushed away, discriminated against, or never get the opportunity to learn they are interested in it.

There is no attempt to make everyone the same - the attempt is to try and allow everyone to embrace their own differences - to give everyone the same opportunities.