I found this paper a couple years back when I was writing about statistical errors. I love the right turn on red example -- an everyday situation where bad statistics leads to extra deaths every year. (Somewhere on the order of 10-100, I think.)
The problem, turning "not statistically significant" into "there is no difference," happens all the time in just about every field of science. Often you see people report "three studies found that this medicine work, but two found that it didn't" and conclude that the evidence is contradictory and can't be trusted. But if you look at the effect sizes, you see the five studies found nearly the same answers -- its just two of them didn't quite cross the threshold for significance.
I wish I had a way of teaching statistical thinking more clearly than standard intro classes. It's so weird and counter-intuitive that very few people get it right. I've given it a shot by writing a book (http://www.statisticsdonewrong.com/) but there's a lot more to be done.
>But if you look at the effect sizes, you see the five studies found nearly the same answers -- its just two of them didn't quite cross the threshold for significance.
This is why I wish meta-analysis was introduced much earlier than it is in statistics education. There are sensible ways of combining the information from the five studies, weighting them according to their sample size (provided the studies are similar in design and cohorts). =)
Deirdre McCloskey (an economist) has an entire book devoted to this[1]. Her article here: http://www.deirdremccloskey.com/docs/jsm.pdf covers the main argument in the book. One important point she makes is that not all fields misuse p-values and statistical significance. In physics significance is almost always used appropriately, while in social sciences (including economics) statistical significance is often conflated with actual significance.
The single paragraph in the postscript of this paper (part 6) is actually really important. It's very common for people who are using statistical testing in applied settings to entirely forget about type II error (and correspondingly, the power of the test), and so when they see a p-value that isn't significant at a certain level (say 5%), then they just assume that the null hypothesis is true.
Of course, this is not correct, and all we can really say is that the test did not reject the null, given the size (type I error rate) and power (type II error rate) of the test. It's entirely possible that the null should be rejected, but the test is just not very good (i.e. it might have the correct size, but very poor power).
So given some complex and eccentric real-world data, how can we figure out what the power of a given test might be in practice? If you have some idea of what the data generating process might look like then one option is to do some simulations. This enables you to see what the size and power properties of your test are by empirically measuring the type I and type II error rates.
People often have trouble applying statistical methods correctly, and perhaps often can manipulate statistics to tell a given story. And indeed, P=0.03 on its own is meaningless without an understanding of how a study is set up and a plausible hypothesis.
But inferential statistics is grounded in sound theory, and used correctly and with appropriate assumptions is a powerful tool to reason about data. Without it, how are you supposed to reason about (for example) study results? Appeal to intuition?
It seems to me that significance tests are not all-powerful, or foolproof, but are still a very valuable tool.
> Without [inferential statistics], how are you supposed to reason about (for example) study results? Appeal to intuition?
The alternative, which the author of the linked article mentions, but doesn't really emphasize, is to report a best estimate of whatever effect you are trying to measure, along with some measure of uncertainty in that estimate.
For example, instead of "we failed to find significant evidence that right turn on red increases the expected number of fatalities," you say "our best estimate of the expected increase in fatalities due to right turn on red is 200 +/- 210."
This approach puts the most relevant information front and center, and it seems to me, encourages better intuitive reasoning. It's what engineers and most of the hard sciences do most of the time.
You do also need to say something about the meaning of your uncertainty estimate (e.g. it's 1 sigma, or 2 sigma, or 95%), or alternatively, there needs to be an understood convention for your field.
Is the method you described not another form of inferential statistics? (Definitely not hypothesis testing however).
It seems that this is a combination of several techniques:
1. First, we either explicitly or implicity choose a model to relate deaths and RTOR laws (perhaps a linear relationship, i.e. [deaths w/ RTOR] = a*[deaths w/o RTOR]).
2. Then we perform point estimation to estimate the parameter "a".
3. Then we compute a confidence interval.
With respect to the RTOR example in the paper, it seems to me that it WOULD be incorrect to reject the null hypothesis that the change in crash numbers arises from random chance for ANY INDIVIDUAL STUDY. In this case it seems that you must figure out a way to transfer information between studies to establish this idea of "statistical significance." Perhaps a survey of studies or usage of bayesian techniques would have resolved the difficulty.
The problem, turning "not statistically significant" into "there is no difference," happens all the time in just about every field of science. Often you see people report "three studies found that this medicine work, but two found that it didn't" and conclude that the evidence is contradictory and can't be trusted. But if you look at the effect sizes, you see the five studies found nearly the same answers -- its just two of them didn't quite cross the threshold for significance.
I wish I had a way of teaching statistical thinking more clearly than standard intro classes. It's so weird and counter-intuitive that very few people get it right. I've given it a shot by writing a book (http://www.statisticsdonewrong.com/) but there's a lot more to be done.