Hacker News new | past | comments | ask | show | jobs | submit login
Statistically controlling for confounding constructs is hard (2016) (plos.org)
143 points by Veedrac on March 31, 2019 | hide | past | favorite | 37 comments



This paper discusses a major issue that I think deserves way more attention from the academically literate crowd. This is the kind of knowledge that should be taught in secondary school, at least in the modern age with papers at everyone's fingertips.

A more approachable summary is in narrative form at

https://www.talyarkoni.org/blog/2016/06/11/the-great-minds-j...

> “But there’s a problem: statistical control–at least the way people typically do it–is a measurement-level technique. Meaning, when you control for the rate of alcohol use in a regression of cancer on bacon, you’re not really controlling for alcohol use. What you’re actually controlling for is just one particular operationalization of alcohol use–which probably doesn’t cover the entire construct, and is also usually measured with some error.”

I strongly suggest reading the comment on that page from one of the authors as well.


Causal inference has been a huge issue in statistics and its subject fields for a while, see DAG, instruments, Regression discontinuity, RCT, natural experiments...

For the main guys, the question is not that causal inference is important and traditional research design is problematic.

The issue is with all these big ego guys. For example, Pearl pretty much states that DAG are the only way and his conception is perfect and everyone who disagrees doesnt know statistics (even though DAG can not capture all models of causal inference).

Then Angrist and Pischke come and say everyone using parametrized designs, including Pearl, is not believable and its RCT logic or bust.

And Gelman fight against everyone anyway.

Much of this is less academic dispute than internet blogfighting, sadly.

The issue is that in practice, we need to do something with imperfect observational data. Its really easy to critique any study on identification and endogeneity. Its difficult to solve the issue!


I don't think it is sad, I think it is wonderful! This is philosophy of science in action. We don't know the right answer yet, it's not obvious, and so we should absolutely argue it out if only to challenge that status quo.


What models of causal inference do you have in mind that is not amenable to DAGs?

The reason DAGs are useful because it shares similar axioms as probability.


There was an amazing couple of papers recently on polygenic risk scores (now trendy in human genetics in place of GWAS, since PRS can include tiny effects from genetic variants that GWAS ignores)

Editorial on both papers here: https://elifesciences.org/articles/45380

In summary, when you run a GWAS, you control for population structure by using the first two or three principal components as covariates. In PRS, the model built is so complex that controlling using PCs doesn't fix all influence by population structure - which is now why some famous large-scale human genetic studies fail to replicate.

(This has large implications on a societal level - Twitter is full of 'race realists' ,i.e., racists, who are using spurious PRS to prop up their foregone conclusions)


It's not remotely that bad... You're leaving out important context here: what failed to replicate was not the GWASes, but inferred selection signals from the GWAS (which is very different); actually, 2/3rds of them replicated anyway (so they're doing much better than, say, social psychology or medicine); and that 3rd one which didn't replicate didn't replicate because the check which would have caught it, the sibling comparison, which was done, turned out to have incorrectly validated it due to a Plink software bug (which is the kind of error which could happen to literally any research result these days). The situation remains precisely as it was before: GWAS results are generally trustworthy, especially when validated by sibling comparisons, and human selection is pervasive.


Can you point me to a reference re: what happened for the third study? If this was caused by an actual Plink bug instead of incorrect usage, I need to verify that the bug has been fixed...


From 10,000 feet up, a two step solution:

(1) Get a lot of data.

(2) Do finely grained cross tabulation.

Why? Because cross tabulation is a discrete version of the most powerful foundation but for challenging questions can need a lot of data. Curve fitting is what are pushed into when don't have so much data.

Or suppose build a model of the probability of an auto accident. Okay, want to evaluate the model for 5' 2", 105 pounds, 17, blond, speaks only Swedish and Russian, and is in the US in LA driving an 18 wheel truck for the first time and talking on her cell phone with her sister in Berlin.

So, for that query, instead of a model, just have a lot of data and cross tabulate, and the cell with that person delivers the answer immediately, directly. Moreover the answer is unbiased and minimum variance (least squares). But did I mention, need a lot of data?


This is fundamentally why we have deep neural networks. Fitting a simplified curve in a huge space.

The combination you named may never have been observed before. The curse of dimensionality strikes.


Yup. Did I mention that with cross tabulation we'd need a lot of data!!!! :-)!!!


But isn't that the whole problem? You quickly run into the case where the required sample size is larger than the entire population size.


So, you expect us to boil it all down to just F = ma, and f'get about the apple, the apple tree, the time of day, the phase of the moon, the temperature of the day, what Newton was wearing?????

That was a really good shot at causality. Let me just say, quite generally, causality is super tough to find.


You did!


How were those embedded charts built? They look like base R but I really like the download option. Is the library under GMT?


Anyone interested in this article should also take a look at some of Pearl’s work on causal inference. Here is one article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2836213/


And a paper somewhat related: How Much Data Do You Need? A Pre-asymptotic Metric for Fat-tailedness [1]

[1] https://arxiv.org/pdf/1802.05495.pdf


Of course it is. This is only news to people who think being able to use a couple of Python libraries makes them a "scientist" of some sort.

See also instrumental variables[1]. Oh, sorry, I forgot, who needs all of the thought that went into the development of econometrics as a discipline? We now have data mining as a career.

[1]: https://www.nuffield.ox.ac.uk/teaching/economics/bond/instru...


Translating to the language of econometrics, I think the point of this paper is that oftentimes, the instrument that you're using has errors in its measurement, or is just a proxy for the 'real' instrument that you'd like to use.

If that's the case, when IV is interpreted as a two-stage regression, you might still have covariates that are correlated with the residuals in the second stage.


>See also instrumental variables

See also causal inference, a subfield of statistics that generalizes instrumental variables and other "controlling" constructs in a very simple and effective way.


So, the article is about regression analysis: And there when I first studied the topic, I saw the interest in assigning importance to variables one at a time based on the size of their regression coefficients.

Well, there is some math supporting regression analysis if want to use that math, i.e., confirm the assumptions, nearly never doable.

So, I did see that if the independent variables were all orthogonal, this one at a time work had some support. E.g., in linear algebra look at Bessel's inequality.

Otherwise, easily can get confused: So, we're trying to predict Y from U and V. U doesn't do at all well; neither does V, but together U and V predict Y essentially perfectly. How, why? U spans a vector space, and Y is not in that space. We can project Y onto that vector space and have the projection small (coefficient of U small). Same for V. But U and V together span a vector space, one that includes Y -- so U and V predict Y perfectly. Just simple vector space geometry. It can happen.

This stuff about controlling can make some good sense in cross-tabulation (see my post here on that), but in regression analysis? Controlling? Where do they get the really strong funny stuff they've been smoking to believe in that?


Mindblowing. No wonder so much science doesn't replicate.

I read scientific papers fairly often compared to the average person, and this is probably the most important I've read in years. It feels like a similar problem to Spectre in the CPU design space: a problem nobody noticed for decades, that breaks fundamental assumptions in ways that appear nearly impossible to fix. Except this is way bigger and more important than Spectre.

Here's a quick summary of the paper and its conclusions:

1. Virtually all claims of the form "We have found X is an important factor when controlling for other factors" are wrong, because they ignore the possibility of measurement error, which has unintuitive and enormous effects.

2. As a consequence, vast swathes of so-called findings in medical and social sciences are spurious to an even greater extent than previously realised. The paper cites Google Scholar searches to show that hundreds of thousands of papers contain trigger phrases like "after controlling for".

3. The problem is fundamental and can't be fixed.

For (3) I should explain, because the paper does propose what initially looks like a fix: an improved form of statistical modelling that incorporates measurement error into the outcomes (SEM). This can restore statistical validity.

Unfortunately as we study this idea closer it becomes clear that there's no way scientists are going to adopt it:

• It explodes the sample sizes and therefore cost.

• Frequently we have no idea what the measurement error actually is.

• Therefore it becomes impossible to say with any confidence whether a claim is true or false.

Even very small amounts of measurement error can push the confidence level of the outcome under the already arbitrary and probably too lax 95% threshold. To detect a weak effect size of 0.1, handling realistic error rates could easily take you from needing hundreds of samples (grad student populations) to tens of thousands.

To clarify the term "measurement error" here, what the paper gets at is that in social sciences they are often trying to measure things indirectly via e.g. surveys which have some inherent noisiness due to people lying, not understanding the question, not being sure of the answer but being unwilling to admit it etc. And then on top of that they are often trying to measure something ill-defined to begin with, e.g. mood, intelligence, wealth.

But as noise rates go up, the chances of spuriously detecting correlations that don't exist reaches 100%. The paper does various simulations to show that this can easily happen with sample sizes, p values and error rates that are standard.

The part that really nailed it for me was where they use a hypothetical example of a correlation between ice cream sales and swimming pool deaths. Common sense tells us ice cream doesn't cause people to die in pools, but rather, on hot days more people go swimming and buy ice cream: it's a confounding variable. In the case where the confounding variable is accurately measured with a thermometer, measurement errors are zero and the typical sort of regression analysis scientists use shows no correlation as expected. But if temperature is measured using a survey where people self report perceived body temperatures on the Likert scale, then you get an error rate of about 0.4. At that point a regression analysis shows a strong direct correlation between eating ice cream and dying in a pool with p < 0.001 - much lower than the p < 0.05 threshold needed to trigger publishing.

A big question is what does this mean for science and the academy? I can see nothing happening in the short run: it seems this paper wasn't picked up by journalists, or at least I've not seen any mention of it even in discussions of the replication crisis. Scientists themselves can't really pay attention to it without breaking the entire funding model of non-STEM sciences.

In the long run this could possibly lead to a major breakdown in society's notions of expertise, trustworthiness of scientists and the value of government funded academia (corporate funding of social sciences is close to zero). How many decades have people been reading stories of the form "$foo causes $bar" or "$baz (good thing) is highly correlated with $zap (bad thing)"? Many people have already clocked that this kind of science is unreliable - now we seem to know precisely why.


> Scientists themselves can't really pay attention to it without breaking the entire funding model of non-STEM sciences.

I can't speak for other social sciences, but in academic economics/econometrics the error-in-variables/measurements problem is a very old and well known issue. (It was in my undergraduate econometrics class.) See e.g. these decades old articles: https://www.jstor.org/stable/pdf/1401917.pdf https://www.jstor.org/stable/1913020

Of course it still gets ignored in plenty of emperical economic research, but as you mentioned the main issue is probably the incentives facing the researchers.


For measurements, the social scientists have been considering what they call reliability and validity for a long time. So, they do consider the accuracy of measurements.

In science we collect data then use the data. If the data has some measurement error, no big surprise. Instead just say that the model is in terms of the measured values with their errors instead of some much more accurate values that don't have.

Or, if have more accurate values, then use them. Else don't feel guilty and beat up on self.

See also my earlier post in this thread on cross tabulation.


Why isn't total least squares used to control for measurement error in regressors?


By "total least squares" you mean to pick regression coefficients that minimize the squared errors between the observed values of the dependent variable and the predicted value from the regression? The predicted values are from a perpendicular projection, and we do get the Pythagorean theorem with

total sum of squares = regression sum of squares + error sum of squares

So, we minimize the error sum of squares. Does that really control on measurement error? Not in any simple way I can see.

Or, if we have errors in the measurements of the independent variables, then we are facing one of the facts of life, we don't have the error-free, true values. Not good but usually not as bad as having your marriage fail or losing your pet dog or cat! Like the video clip of a Heifetz master class where a student tried to play the D flat minor scale and at the end Heifetz assured the student that they were still alive!

Maybe what you are saying is that with enough mathematical assumptions, e.g., the famous heteroscedasticity with independent, identically distributed mean zero Gaussian errors, as the number of observations goes to infinity the errors wash out much as in the weak/strong laws of large numbers and we get to f'get about the errors -- maybe there is such a theorem, I should get out my copy of the old

C. Radhakrishna Rao, Linear Statistical Inference and Its Applications: Second Edition, ISBN 0-471-70823-2, John Wiley and Sons, New York.

and look or do some such derivations for myself.

But, to what end if we don't believe the mathematical assumptions for the mathematical theorems, e.g., heteroscedasticity with independent, identically distributed mean zero Gaussian errors?

Sorry, from 50,000 feet up, it seems to me that having control variables in regression is shaky stuff. And without some careful derivations, we should not be surprised at the effects of various errors.

Also, the usual derivations of the math are in the context of just some one regression model where we make all those assumptions. Instead, given one dependent variable and 10 independent variables plus five more we believe are causes plus 10 more we want to use as controls, the dependent variable plus 25 more variables in all, last time I checked we were short on how to pick the 2^25 sets of independent variables and make sense of the different, maybe wildly different, coefficients we get.

Here's a simple view: If we have 5 independent variables and they are all orthogonal, then we can get the regression coefficients one at a time just from 5 projections, covariances, inner products (all essentially the same things except how we scale things) and have those coefficient just the same for any of the 2^5 regression analyses. That is, if we have orthogonal independent variables U, V, W, X, and Y and dependent variable Z, then we can get the coefficients one at a time and be done -- have all the regression coefficients for all 2^5 regressions. Otherwise, without othogonality, we face some possibly tricky math derivations -- maybe they are in Rao's book, it's thick enough -- and are asking a bit too much from regression analysis.

Others have seen this swamp, and a current idea from the machine learning community, and going back at least to L. Breiman, is that we are not really looking for coefficients, t-tests on the coefficients, F-ratio tests on the regressions, confidence intervals on the predictions (for that might try some resampling ideas), importance of coefficients, causes, control variables, etc. but are just looking for a fit that can predict: To this end we put the data into at least two buckets, fit to one bucket and test in another one. Our main criterion is just that the model predicts well for the data we have. That is, all the data in all the buckets has all the same statistical assumptions, whatever the heck they are, and we are just fitting and then testing (confirming) on simple random samples of data all from some one big bucket. Yes, we still run into the issue of overfitting, fit well in the first bucket but flop terribly testing on the second bucket. Okay, a bit crude, uncouth, vulgar, primitive, ..., etc. but maybe useful in some cases -- apparently Breiman made it useful in some cases of medical data.


I inspired a great soliloquy! Thanks for the thoughts

My point: total least squares includes X in the error minimization, not just Y and linear combination of X. There is a good introductory discussion on wiki -- essentially in standard regression we typically assume no measurement error in the independent variables.[0]

As much value as machine learning brings, there is a need for explaining as much as there is for predicting![1]

Your point on whether there is "true control" seems to agree with Pearl's main point of contention -- does the causality plot (which is testable) make sense from a theoretical, experiential, or systemic sense?

[0] https://en.wikipedia.org/wiki/Total_least_squares#/media/Fil...

[1] https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf


Okay, "total least squares" as in your [0]!!!! WOW!!! Back when I knew nothing of regression or curve fitting and was first considering the issue, the question I asked was, if we are trying to fit to the given data, why not have the line as close as possible to each of the points on the scatter diagram, just as in the quite good picture at [0]!! Gee, value of ignorance!!

Again I believe we are trying to make too much out of regression.

Or, maybe, if somehow we DO have causes, we really know they are the real causes, and we have some data, good data, and the data likely satisfies the usual assumptions as in the reference I gave to Rao, THEN, maybe, on a good day, with luck, do the regression calculations, t-tests, the F-ratio, get the confidence intervals on the coefficients and the predicted values, etc. and if all that looks solid, then take it seriously.

Here, however, KNOW the independent variables, all of them, KNOW that they are the causes, and don't need controls, etc. and are not fishing for the variables, we are not trying to have statistics tell us about causes, ..., then maybe okay.

But, sure, if there really are causes and if we really do have variables that do well measuring those causes, then maybe in the regression the variables that are candidates as causes will become fairly obvious.


Regression is useful because it allows us to interpolate within observed populations using relatively light assumptions. Extrapolation requires higher order theories and structure. Agreed that it can be a logical mess when one uses it bluntly, but like all tools it has its uses and misuses.


Ah, cruel, you are so cruel, how could you be so cruel; the OP was hoping for something so much better than just some interpolation!!!

Cruel or not, at least in practice, you are on solid ground.


Then you may be interested in pursuing structural models!

Thanks for the chat.


Total least squares is pretty bog standard in statistics and has a lot of literature, including monographs and text books.

You are correct about the Pythagoras theorem and by virtue of that TLS has close connections with PCA, in fact, once you have the PCA model you can derive the TLS coefficients from the PCA parameters.

The tricky bit is that in TLS the number of nuisance parameters grows with the data so it wouldn't be immediately obvious that estimates would converge. It turns out that it does.


non-STEM sciences.

The first word of "STEM" is science. You mean something there that isn't well expressed by what you said.


You're right, that was sloppy.

I was getting at "social sciences" or alternatively "sciences in which things cannot be precisely measured".


Observational studies claiming causality are flawed compared to randomised controlled experiments.

Hopefully in time research will gravitate toward RCTs.

Distinguished Engineer at Microsoft, Ronny Kohavi has a great presentation on flawed observational studies here:

https://www.exp-platform.com/Documents/2016-11BestRefutedCau...


But the problem is in many application domains, you cannot obtain holdout sets. For example, if you measure the causal impact of ads for a client, they usually will not agree to accept the cost of holdout inventory in order to measure the effects. They’d rather ensure all the inventory is used, and let the observational post-hoc attempt at causal inference have errors.. they knowingly prefer that trade-off.

There are actually a lot of areas like this, where you are just given a one-sided observational data set snd tasked with recovering causal effects with no option to collect a holdout set. Often nobody consciously chose it that way over an RCT, it’s just how it happened.


or simple covariates like a subject's age. If we could randomly assign people to an age, then...we'd have conquered aging I guess.


Confounding variables like destitutite candidates in the hiring process with a model trained by educated people with savings?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: