I think XLA is trying to reduce the overhead introduced by backprop, meaning when you optimize the computational graph you might end up with an efficient calculation of the gradient (closer to the calculation you get with MC).
Regarding non-scalar valued functions: Don't you reduce a constrained problem to a series of unconstrained problems (via a penalty (or even augmented Lagrangian) or barrier method)? Then you only need the gradient of a scalar valued function to solve constrained problems. I imagine you can use the gradients of the constraints for solving the KKT conditions directly but this seems only useful if the system is not too big. But for sure it opens new possibilities.
XLA is good for the GPU only. On the CPU MC is about 20-50% faster than TF on scalar valued functions. For the GPU I don't know yet. But it is true that for augmented Lagrangian you only need scalar valued functions. This is really efficient on large-scale problems. But on small-scale problems (up to 5000 variables) you really need interior point methods that solve the KKT conditions directly as you point out. This is sometimes really needed.
However, when you look at the algorithms TF and MC do not differ too much. In the end, there is a restricted number of ways of computing derivatives. And basically, most of them are the same (or boil down to two versions). Some of the claims/problems made in the early autodiff literature concerning symbolic diff is just not true. In the end, they are fairly similar. But lets see how XLA performs.