Simpson's Paradox is deeper than conventional analyses suggest. I'm going to use the pilots'-late-arriving-flights data:
name | delay%
alice | 30%
bob | 20%
So Bob's flights are delayed less often, he's the better pilot, right? Yet:
name | night | day
alice | 7/25 | 1/5
bob | 3/10 | 3/20
So now Alice looks like the better pilot!
But wait, what if the pilots are responsible for scheduling their own flights? Bob's individual batting averages might be somewhat worse than Alice's, but he's making better decisions about when to fly.
But wait, what if Alice and Bob fly out of the same airport(s), and they've agreed to let Bob fly during the day... (this is not meant to be an accurate representation of air traffic control)
When faced with a Simpson-like situation, a correct analysis usually requires considering the chain of causation, in particular, whether any stratification of the data depends on the independent variable being tested. If the stratification is a result of the test variable, it usually isn't a good one. In the Berkeley example, this possibility is the highly unlikely situation that applicants were automatically assigned to departments with gender taken into consideration -- so the stratification was valid.
In the past, I've considered Simpson's paradox to be nothing more than an amusing quirk of statistics. But, the parent example has helped me to see that Simpson's paradox is frighteningly inevitable.
For example, a conclusion like:
"People taking a particular drug have worse outcomes."
doesn't reflect on the efficacy of the drug at all - because people choosing to take the drug are presumably in greater need.
Same in the piloting example above - Alice is scoring worse only because she has the more difficult assignments (perhaps because she is indeed the better pilot).
It really is a particularly pathological special case of "correlation does not imply causation." Often the key is the presence of a confounding variable.
I don't see how the second table makes Alice look like the better pilot. She's better than Bob in the night, but worse than Bob in the day, and overall worse. Could someone explain?
I don't understand how Simpson's paradox is different from missing an explanatory variable and confusing correlation vs. partial correlation.
In Wikipedia's article header chart, what I see is the projection on a plane of a 3D problem, where the 3rd dimension has been overlooked. http://en.wikipedia.org/wiki/Simpson's_paradox
In Bob vs. Alice, I see also that the night/day flight dummy wasn't accounted for hence resulting in the so-called paradox.
It's just a special case of omitted variables with categorial variables. So instead of parameter estimates being biased up or down x amount (to the extent covariates are correlated with error terms), with Simpsons's paradox the mean effect is completely wrong due to improper grouping. This often leads to flipping signs on estimated parameters -- 'surprising' results that gets papers published.
The more complicated examples of Simpson's paradox tend to be important causes being ignored. But it's not always an issue of causality, like in the example of two Wikipedia contributors. That example doesn't really have a hidden cause, it's just the use of percentages where total articles is clearly the more useful metric.
Linked post: "Everything is significant in large datasets." Certainly true, and why people should be suspicious if they see a p-value without an effect size.
Original post: "To me, one of the most unfortunate aspects of log-linear analysis as it is commonly practiced is that it is significance testing-centric, rather than based on point or interval estimation."
name | delay% alice | 30% bob | 20%
So Bob's flights are delayed less often, he's the better pilot, right? Yet:
name | night | day alice | 7/25 | 1/5 bob | 3/10 | 3/20
So now Alice looks like the better pilot!
But wait, what if the pilots are responsible for scheduling their own flights? Bob's individual batting averages might be somewhat worse than Alice's, but he's making better decisions about when to fly.
But wait, what if Alice and Bob fly out of the same airport(s), and they've agreed to let Bob fly during the day... (this is not meant to be an accurate representation of air traffic control)
When faced with a Simpson-like situation, a correct analysis usually requires considering the chain of causation, in particular, whether any stratification of the data depends on the independent variable being tested. If the stratification is a result of the test variable, it usually isn't a good one. In the Berkeley example, this possibility is the highly unlikely situation that applicants were automatically assigned to departments with gender taken into consideration -- so the stratification was valid.