By "total least squares" you mean to pick regression coefficients that minimize the squared errors between the observed values of the dependent variable and the predicted value from the regression? The predicted values are from a perpendicular projection, and we do get the Pythagorean theorem with
total sum of squares = regression sum of squares + error sum of squares
So, we minimize the error sum of squares. Does that really control on measurement error? Not in any simple way I can see.
Or, if we have errors in the measurements of the independent variables, then we are facing one of the facts of life, we don't have the error-free, true values. Not good but usually not as bad as having your marriage fail or losing your pet dog or cat! Like the video clip of a Heifetz master class where a student tried to play the D flat minor scale and at the end Heifetz assured the student that they were still alive!
Maybe what you are saying is that with enough mathematical assumptions, e.g., the famous heteroscedasticity with independent, identically distributed mean zero Gaussian errors, as the number of observations goes to infinity the errors wash out much as in the weak/strong laws of large numbers and we get to f'get about the errors -- maybe there is such a theorem, I should get out my copy of the old
C. Radhakrishna Rao,
Linear Statistical Inference and
Its Applications:
Second Edition,
ISBN 0-471-70823-2,
John Wiley and Sons,
New York.
and look or do some such derivations for myself.
But, to what end if we don't believe the mathematical assumptions for the mathematical theorems, e.g.,
heteroscedasticity with independent, identically distributed mean zero Gaussian errors?
Sorry, from 50,000 feet up, it seems to me that having control variables in regression is shaky stuff. And without some careful derivations, we should not be surprised at the effects of various errors.
Also, the usual derivations of the math are in the context of just some one regression model where we make all those assumptions. Instead, given one dependent variable and 10 independent variables plus five more we believe are causes plus 10 more we want to use as controls, the dependent variable plus 25 more variables in all, last time I checked we were short on how to pick the 2^25 sets of independent variables and make sense of the different, maybe wildly different, coefficients we get.
Here's a simple view: If we have 5 independent variables and they are all orthogonal, then we can get the regression coefficients one at a time just from 5 projections, covariances, inner products (all essentially the same things except how we scale things) and have those coefficient just the same for any of the 2^5 regression analyses. That is, if we have orthogonal independent variables U, V, W, X, and Y and dependent variable Z, then we can get the coefficients one at a time and be done -- have all the regression coefficients for all 2^5 regressions. Otherwise, without othogonality, we face some possibly tricky math derivations -- maybe they are in Rao's book, it's thick enough -- and are asking a bit too much from regression analysis.
Others have seen this swamp, and a current idea from the machine learning community, and going back at least to L. Breiman, is that we are not really looking for coefficients, t-tests on the coefficients, F-ratio tests on the regressions, confidence intervals on the predictions (for that might try some resampling ideas), importance of coefficients, causes, control variables, etc. but are just looking for a fit that can predict: To this end we put the data into at least two buckets, fit to one bucket and test in another one. Our main criterion is just that the model predicts well for the data we have. That is, all the data in all the buckets has all the same statistical assumptions, whatever the heck they are, and we are just fitting and then testing (confirming) on simple random samples of data all from some one big bucket. Yes, we still run into the issue of overfitting, fit well in the first bucket but flop terribly testing on the second bucket. Okay, a bit crude, uncouth, vulgar, primitive, ..., etc. but maybe useful in some cases -- apparently Breiman made it useful in some cases of medical data.
I inspired a great soliloquy! Thanks for the thoughts
My point: total least squares includes X in the error minimization, not just Y and linear combination of X. There is a good introductory discussion on wiki -- essentially in standard regression we typically assume no measurement error in the independent variables.[0]
As much value as machine learning brings, there is a need for explaining as much as there is for predicting![1]
Your point on whether there is "true control" seems to agree with Pearl's main point of contention -- does the causality plot (which is testable) make sense from a theoretical, experiential, or systemic sense?
Okay, "total least squares" as in your [0]!!!! WOW!!! Back when I knew nothing of regression or curve fitting and was first considering the issue, the question I asked was, if we are trying to fit to the given data, why not have the line as close as possible to each of the points on the scatter diagram, just as in the quite good picture at [0]!! Gee, value of ignorance!!
Again I believe we are trying to make too much out of regression.
Or, maybe, if somehow we DO have causes, we really know they are the real causes, and we have some data, good data, and the data likely satisfies the usual assumptions as in the reference I gave to Rao, THEN, maybe, on a good day, with luck, do the regression calculations, t-tests, the F-ratio, get the confidence intervals on the coefficients and the predicted values, etc. and if all that looks solid, then take it seriously.
Here, however, KNOW the independent variables, all of them, KNOW that they are the causes, and don't need controls, etc. and are not fishing for the variables, we are not trying to have statistics tell us about causes, ..., then maybe okay.
But, sure, if there really are causes and if we really do have variables that do well measuring those causes, then maybe in the regression the variables that are candidates as causes will become fairly obvious.
Regression is useful because it allows us to interpolate within observed populations using relatively light assumptions. Extrapolation requires higher order theories and structure. Agreed that it can be a logical mess when one uses it bluntly, but like all tools it has its uses and misuses.
Total least squares is pretty bog standard in statistics and has a lot of literature, including monographs and text books.
You are correct about the Pythagoras theorem and by virtue of that TLS has close connections with PCA, in fact, once you have the PCA model you can derive the TLS coefficients from the PCA parameters.
The tricky bit is that in TLS the number of nuisance parameters grows with the data so it wouldn't be immediately obvious that estimates would converge. It turns out that it does.
total sum of squares = regression sum of squares + error sum of squares
So, we minimize the error sum of squares. Does that really control on measurement error? Not in any simple way I can see.
Or, if we have errors in the measurements of the independent variables, then we are facing one of the facts of life, we don't have the error-free, true values. Not good but usually not as bad as having your marriage fail or losing your pet dog or cat! Like the video clip of a Heifetz master class where a student tried to play the D flat minor scale and at the end Heifetz assured the student that they were still alive!
Maybe what you are saying is that with enough mathematical assumptions, e.g., the famous heteroscedasticity with independent, identically distributed mean zero Gaussian errors, as the number of observations goes to infinity the errors wash out much as in the weak/strong laws of large numbers and we get to f'get about the errors -- maybe there is such a theorem, I should get out my copy of the old
C. Radhakrishna Rao, Linear Statistical Inference and Its Applications: Second Edition, ISBN 0-471-70823-2, John Wiley and Sons, New York.
and look or do some such derivations for myself.
But, to what end if we don't believe the mathematical assumptions for the mathematical theorems, e.g., heteroscedasticity with independent, identically distributed mean zero Gaussian errors?
Sorry, from 50,000 feet up, it seems to me that having control variables in regression is shaky stuff. And without some careful derivations, we should not be surprised at the effects of various errors.
Also, the usual derivations of the math are in the context of just some one regression model where we make all those assumptions. Instead, given one dependent variable and 10 independent variables plus five more we believe are causes plus 10 more we want to use as controls, the dependent variable plus 25 more variables in all, last time I checked we were short on how to pick the 2^25 sets of independent variables and make sense of the different, maybe wildly different, coefficients we get.
Here's a simple view: If we have 5 independent variables and they are all orthogonal, then we can get the regression coefficients one at a time just from 5 projections, covariances, inner products (all essentially the same things except how we scale things) and have those coefficient just the same for any of the 2^5 regression analyses. That is, if we have orthogonal independent variables U, V, W, X, and Y and dependent variable Z, then we can get the coefficients one at a time and be done -- have all the regression coefficients for all 2^5 regressions. Otherwise, without othogonality, we face some possibly tricky math derivations -- maybe they are in Rao's book, it's thick enough -- and are asking a bit too much from regression analysis.
Others have seen this swamp, and a current idea from the machine learning community, and going back at least to L. Breiman, is that we are not really looking for coefficients, t-tests on the coefficients, F-ratio tests on the regressions, confidence intervals on the predictions (for that might try some resampling ideas), importance of coefficients, causes, control variables, etc. but are just looking for a fit that can predict: To this end we put the data into at least two buckets, fit to one bucket and test in another one. Our main criterion is just that the model predicts well for the data we have. That is, all the data in all the buckets has all the same statistical assumptions, whatever the heck they are, and we are just fitting and then testing (confirming) on simple random samples of data all from some one big bucket. Yes, we still run into the issue of overfitting, fit well in the first bucket but flop terribly testing on the second bucket. Okay, a bit crude, uncouth, vulgar, primitive, ..., etc. but maybe useful in some cases -- apparently Breiman made it useful in some cases of medical data.