XLA is good for the GPU only. On the CPU MC is about 20-50% faster than TF on scalar valued functions. For the GPU I don't know yet. But it is true that for augmented Lagrangian you only need scalar valued functions. This is really efficient on large-scale problems. But on small-scale problems (up to 5000 variables) you really need interior point methods that solve the KKT conditions directly as you point out. This is sometimes really needed.
However, when you look at the algorithms TF and MC do not differ too much. In the end, there is a restricted number of ways of computing derivatives. And basically, most of them are the same (or boil down to two versions). Some of the claims/problems made in the early autodiff literature concerning symbolic diff is just not true. In the end, they are fairly similar. But lets see how XLA performs.