Hacker News new | past | comments | ask | show | jobs | submit login
Scientists Perturbed by Loss of Stat Tool to Sift Research Fudge from Fact (scientificamerican.com)
79 points by jonbaer on April 17, 2015 | hide | past | favorite | 39 comments



Here's the editorial by David Trafimow and Michael Marks explaining the new policy for their journal "Basic and Applied Social Psychology": http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.101...

And here's their concluding paragraph explaining their rationale and objective:

  We conclude with one last thought. Some might view
  the NHSTP[1] ban as indicating that it will be easier to
  publish in BASP, or that less rigorous manuscripts will
  be acceptable. This is not so. On the contrary, we   
  believe that the p < .05 bar is too easy to pass and sometimes
  serves as an excuse for lower quality research. We hope
  and anticipate that banning the NHSTP will have the
  effect of increasing the quality of submitted manuscripts
  by liberating authors from the stultified structure of
  NHSTP thinking thereby eliminating an important
  obstacle to creative thinking. The NHSTP has dominated
  psychology for decades; we hope that by instituting
  the first NHSTP ban, we demonstrate that
  psychology does not need the crutch of the NHSTP,
  and that other journals follow suit.
[1] NHSTP = null hypothesis significance testing procedure


And what should be used as an alternative? There is no reason to believe banning p value would result in improved research quality. Desperate researchers will just find another technique to game.


My thinking is that there should be no alternative to NHSTP that would function as a crutch, that if something else were to arise in an attempt to take the place of NHSTP, it will not be gain as much prominence among researchers. Are there any other scalar statistical numbers in statistics that can function the same as p-values? That is:

1. Can the general public see it as jargon? The general public see p-value, and they think 'oh, those scientists probably know what that really means, just tell me if that number should be big or small to be meaningful'

2. Can a headline be made of it? ( Hypothesis X shown to be 'statistically significant' because a journalist saw the p-value result)

3. If a colleague does not understand it, can you make fun of them for not understanding such a 'simple' concept?


Instead of blindly applying a tool you don't understand, you'll need to build a statistical model, explain why it's valid, and then construct a meaningful measurement based on it.

Then the referee and reader will be required to understand it and will have the ability to critique it.


Which is good, as it will discourage people from submitting there, which means more potential papers for those of us who got on this train a while ago.


Yup. There is no way to resolve it. Everything besides numbers can be used to argue against them, because all numbers (optimistically) can measure is a very precise and pedantic simplification of an carefully constructed argument.

In the case of hard sciences and engineering, it's direct sensory data that feeds the argument and maintains the rigor. Every person's eyes more or less, function the same, lest we descend into philosophizing about language, continuity of meaning and perception.

Science will eventually have to come to terms with realizing that interpretation and truth is much more variant than the ideals of science reach for. A lot of social sciences have accepted this, and even teach this. But the expectation is people should then be more intelligent, hard working, and overall have greater competency than the norm demands. Unfortunately this is what creates part of the problem - the ability to bring a pattern into actual existence, by observing it before it actually exists.

I don't like the idea of survival of the fittest in research and ideas though. I think it starts with simple questions, like 'what are we looking to prove/demonstrate/etc?', 'why/how?', which is concurrently processed with stuff like ethical and moral introspection and discussion.

But the big point of science half the time is that no one knows what they are doing. The prestige and authority surrounding academia doesn't really place as much emphasis on this as it should.



I think statistics is one of the most misunderstood field of mathematics relative to how much the general person believes he or she knows about it. I know this article was about researchers and not the general public, but I can definitely sympathize with them; I have a MSc in engineering physics and I still have to think three times about a number before I know that all assumptions I have made while calculating it were correct and not biased, and three times more about what this number means and what conclusions I can actually draw from it.


Absolutely. If I were in charge of curriculum development for engineers, I'd swap out a couple of semesters of math that the student will never see again unless they end up working on EM simulators, and replace it with statistics course(s) that will be helpful throughout their careers.

Unfortunately there's a lot of coursework in engineering that amounts to institutionalized hazing. The professor had to do it, so by the hallowed beard of Frobenius, the students have to do it too...


They combined the probability and statistics courses in my undergrad engineering program. When I went through the program, they were separate and I found the statistics course to be excellent. Unfortunately, they had already made it less valuable before they were combined because they had dropped other components of the course. They combined the courses so they could spend more time on the introductory math courses.

Before, there was a grade 13 that provided a better math foundation for university, but since that was eliminated, they needed to spend more time on the math as students were doing consistently poor. Things have improved in the math courses since the change, though.

The other reason why they combined the courses was that we were the only engineering department that had separate courses for them. But, we were also the only engineering department with machine learning and pattern recognition courses as well. Now, those courses need to spend more time at the beginning covering the material lost from combining the courses.


At one top British university, the first year Physics practicals are universally reviled, everyone gets within a few marks of each other, the experiments are trivial (roll a ball down a slope!), and at every meeting the academics all agree that they'd prefer that they were dropped. Unfortunately there's some kind of government requirement for practical work, so they stay.


>the experiments are trivial (roll a ball down a slope!)

they should replace it with trivial experiments of the 21st century - photon counting for Bell inequalities testing.

Wrt. the original article - good riddance, one less orthodoxy in science. Because it isn't about a tool - p-values in this case - itself, it is about orthodoxy which is the main enemy of science.


Oh exactly, sounds like there is ways of getting around the government mandate.


Do they not take the chance to teach error propagation and some related statistics etc in this course? As well as the basics of scientific writing. And encourage students to invest in learning LaTeX already.


Oh, absolutely! However, you don't need 32 hours in a lab and 4000 words of formal reports and 100-odd pages of handwritten notes in order to prove that.


Yeah point taken. However, I suspect that the lab sessions are tedious enough to help put a lot of students off studying physics (worked for me) and this may be part of the reason that they are so boring...


I find this whole backlash against p-values pretty confusing. That is probably because I come from particle physics, where we also use a lot of statistics, but in subtly different ways.

Hypothesis testing is not too hard [1]. You pick a cutoff, say p < 0.003 ("3 sigma"), and then if your p-value is below that, you call it evidence for your signal - otherwise you just don't have evidence. By doing so, the probability is 0.3% to have data this or more signal like, assuming there is no signal. With other words, if you follow this prescription and are looking for something that isn't there, in 0.3% of cases you will (wrongly) claim evidence (error of the first kind).

Since we are a cautious bunch, we actually put the threshold for discovery at 5 sigma - p<0.0000003 - which sometimes gets us ridiculed by statisticians. This hyper-strict standard shouldn't be necessary, but in part it's a hedge against the case where you get your systematic errors wrong (you believe your prediction for the null hypothesis is more accurate than it is - so if you see a slight fluctuation, it seems to be many (wrong) standard deviations away).

One other thing that we have to take into account - and many people forget this - is the look-elsewhere effect. If you perform one search, looking for e.g. a Higgs Boson with a mass of 126 GeV, you expect N events in your experiment if it is not there, and N+X if it is there. You know how N is distributed, and the interpretation is straightforward. However if you perform a scan, looking at 120, 121, 122, 123... GeV, then you have to adjust your p-value, since you are basically performing a bunch of different experiments, and by chance alone some of them are bound to turn up "significant".

The same thing applies when hundreds or thousands of Master and PhD students and postdocs do their analyses - even if no one makes a mistake, some of them will "find" a 3 sigma or larger effect that isn't there, just due to the sheer number of independent statistical tests performed. I've "found" new particles myself this way, but when you keep calm, put it into context by looking at other analyses, and try to add more data, you'll often find that your result melts away.

------------

[1] explaining it is hard, and I will undoubtedly have messed up, especially since I'm tired.


The interesting problem is that if you set a strict significance threshold but have a sample size that is too small, you will still sometimes correctly get significant results, but the effect sizes will all be exaggerated.

If the sample size is too low, the only effect size big enough to be significant may be one that is much larger than the truth. So you only claim significance when you get lucky and exaggerate.

This is actually an enormous problem, and it probably affects physics too. Many medical and biological papers tout huge effects which turn out to be completely wrong -- not because their result was a false positive but because it's an exaggerated true positive. Sample sizes tend to be hilariously inadequate in soft sciences where each data point costs thousands of dollars.

I call this "truth inflation," although I don't know if it's been discussed enough to have a common name. It's heavily discussed in my book, if you're interested in seeing why the backlash against p values is so widespread: http://www.statisticsdonewrong.com/regression.html#truth-inf...


> I find this whole backlash against p-values pretty confusing.

Plenty of research (not in particle physics) uses the cut-off of 0.05 instead, misunderstands the p-value as "probability that the result is not real", and also ignores the fact that a p-value of 0.05 can too often be reached by making a variety of choices in experimental setup and data analysis. When the headline calls it a "Tool to Sift Research Fudge from Fact", NHST wasn't doing this job well at all, and wasn't really designed for this job in the first place. There are far, far too many people who seem to think that p<0.05 means something is a "research fact", which is a failure of statistics education as well.


I studied physics too, and am not a statistician.

My way of interpreting an "absurd" threshold such as 5-sigma, is that you can stop using hypothesis testing, and start using other tools for interpreting data.

Another interpretation is that we set these huge limits because, deep in our hearts, we aren't really comfortable with hypothesis testing. There's Rutherford's famous quip: If your experiment needs statistics, then you should have done a better experiment.

I think statistics are actually used pretty sparingly and with caution in physics. There are lots of experiments where the presence or absence of a phenomenon is not the primary question. I didn't report a p-value at all in my thesis project.


> which sometimes gets us ridiculed by statisticians

You get laughed at because you have this huge luxury at dissecting huge amounts of data, running the experiment billions of times.

Do you understand that in most disciplines, you don't have that luxury?


To begin with, you'll have to explain to the physicists that there are other disciplines besides physics.


Sample size. Also particles are "the same" and much simpler.

Put a particle in a magnetic field and it will behave the same every time (and you can usually do this 1e3, 1e4, 1e5 times as needed). Give 100 people the same drug (ok, 50 of them get the drug, 50 get the placebo) and see each one have a different reaction.


Actually the whole point of particle physics is that the same thing _doesn't_ happen every time. Quantum field theory allows us to predict probabilities e.g. of a particle decaying to another particle, for a given model. As for doing something 1e4, 1e5 times, at the LHC we collided bunches of 10e10 particles continuously every 50ns for months, and still only produced a few hundred Higgs particles. So these studies of ultra rare processes tend to also have limited statistical power.


It's still considerably less than the variability in sciences based in biology, though.


Yes, in quantum physics you have a set of possibilities and their probabilities, and with more experiments you'll produce more Higgs particles, for example


> we actually put the threshold for discovery at 5 sigma - p<0.0000003 - which sometimes gets us ridiculed by statisticians

This says more about the sillyness of statisticians than that of particle physicists.


Your explanation worked really well for me. I now understand what it is bayesians don't like about p value reliance.


There has been plenty of debate about this in other fields as well. Deirdre McCloskey and Stephen Ziliak have a particularly well-written paper titled "The Cult of Statistical Significance" on this very topic. Their main point is that statistical significance is meaningless without a discussion of magnitudes.

[1] http://www.deirdremccloskey.com/docs/jsm.pdf


Looks like this may be Prof. Leek's course mentioned at the end of the article: https://www.coursera.org/course/statinference

Also there was previous HN post on p-values, which I found really interesting: https://news.ycombinator.com/item?id=9330076

Also this Nautilus article: http://nautil.us/issue/4/the-unlikely/sciences-significant-s...


>>Several journals are trying a new approach...in which researchers publicly “preregister” all their study analysis plans in advance. This gives them less wiggle room to engage in the sort of unconscious—or even deliberate—p-hacking that happens when researchers change their analyses in midstream to yield results that are more statistically significant than they would be otherwise. In exchange, researchers get priority for publishing the results of these preregistered studies—even if they end up with a p-value that falls short of the normal publishable standard.

It's not exactly the same issue as the one addressed by banning p-values, but this would help a lot.


This is a bad idea. Sure, p-test is pretty flawed, but this is like going without an antivirus because the one you have has bad detection rates.

The researchers are not the only ones who could game the system. A bigger problem is the editorial staff. Replacing an objective test, however bad, with a nonspecific 'case by case' criteria opens the door for nepotism and political agenda pushing. Psychology is an especially dangerous field for this, with the potential to label entire groups of people with opposing views as mentally ill.

The cynic in me sees this as a power-grab.

What they should have done is specify Bayesianism as the new test, period. None of this case-by-case BS.


Interesting reading:

54% of findings with p < 0.05 not statistically significant: http://www.dcscience.net/Schuemie-Madigan-2012.pdf easy stats paper as to why: http://www.stats.org.uk/statistical-inference/Lenhard2006.pd...


I think it's easier to explain this in terms of likelihood theory. Likelihood is the probability of observed data GIVEN A SPECIFIC MODEL. This is NOT to be confused with the probability of a specific model being correct GIVEN THE OBSERVED DATA. It is the later probability that people want to really know, but confusing it with the former probability can have catastrophic consequences in fields like medicine, engineering, jurisprudence, finance and insurance.

The problem is related to the Prosecutor's Fallacy (https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy): "Consider this case: a lottery winner is accused of cheating, based on the improbability of winning. At the trial, the prosecutor calculates the (very small) probability of winning the lottery without cheating and argues that this is the chance of innocence. The logical flaw is that the prosecutor has failed to account for the large number of people who play the lottery."

There is a mathematical statistics professor at the University of Toronto named D.A.S. Fraser (http://www.utstat.utoronto.ca/dfraser/) who is an expert in likelihood theory and has commented on this issue: "...statistics does have the answer! The answer is contained in the p-value function from likelihood theory: p(delta) Here delta is the relevant parameter with delta_0 as the null value and delta_1 as the alternative needing detection. Then p(delta_0) is the observed p-value, p(delta_1) is the detection probability, and the rest is judgement: the route to the Higgs boson."


A few obligatory references:

https://xkcd.com/1478/, along with the awesome explain-xkcd: http://www.explainxkcd.com/wiki/index.php/1478:_P-Values

https://xkcd.com/892/

And my favorite:

https://xkcd.com/882/


The video that made it all come together for me was this "Dance of the P-Values" video:

https://www.youtube.com/watch?v=5OL1RqHrZQ8

It does a great job of showing how the same experiment can yield vastly different p-values and why it's not useful for the task it's been given.


Their issue is basically with the lack of negative result reporting that makes p-values useless; it seems very odd to vilify a valuable tool when it's never going to solve the real problem that people generally re-run experiments until they stumble upon a publishable metic.


Despite the present day predominance of the life sciences (including psychology), I wish articles would not describe the p-value crisis as a crisis of "science". Physical scientists rarely use p-values. There are lots of other ways to establish the robustness of a result, most importantly, not relying on just one tool.


A very well written article.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: