On point 7 ((Testing an unclear hypothesis), while agreeing with the overall point, I strongly disagree with the examples.
> Bad Hypothesis: Changing the color of the "Proceed to checkout" button will increase purchases.
This is succinct, clear, and is very clear what the variable/measure will be.
> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page. This will then lead to more purchases.
> User research showed that users are unsure of how to proceed to the checkout page.
Not a hypothesis, but a problem statement. Cut the fluff.
> Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page.
* Turns out, folks are seeing the "buy" button just fine. They just aren't smitten with the product. Making "buy" more attention-grabbing gets them to the decision point sooner, so they close the window.
* Turns out, folks see the "buy". Many don't understand why they would want it. Some of those are converted after noticing and reading an explanatory blurb in the lower right. A more prominent "buy" button distracts from that, leading to more "no".
* For some reason, a flashing puke-green "buy" button is less noticable, as evidenced by users closing the window at a much higher rate.
Including untestable reasoning in a chain of hypothesises leads to false confirmation of your clever hunches.
The biggest issue with those three hypotheses is one of them, the noticing the button, almost certainly isn't being tested. But, how the test goes will inform how people think about that hypothesis.
That doesn't test noticing the button, that tests clicking the button. If the color changes it is possible that fewer people notice it but are more likely to click in a way that increases total traffic. Or more people notice it but are less likely to click in a way that reduces traffic.
This is what I was driving at in my original comment - the intermediary steps are not of interest (from the POV of the hypothesis/overall experiment), so why mention them at all.
It is surely helpful to have a "mechanism of action" so that you're not just blindly AB testing and falling victim to coincidences like in https://xkcd.com/882/ .
Not sure if people do this, but with a mechanism of action in place you can state a prior belief and turn your AB testing results into actual posteriors instead of frequentist metrics like p-values which are kind of useless.
That xkcd comic highlights the problem with observational (as opposed to controlled) studies. TFA is about A/B testing, i.e. controlled studies. It’s the fact that you (the investigator) is controlling the treatment assignment that allows you to draw causal conclusions. What you happen to believe about the mechanism of action doesn’t matter, at least as far as the outcome of this particular experiment is concerned. Of course, your conjectured mechanism of action is likely to matter for what you decide to investigate next.
Also, frequentism / Bayesianism is orthogonal to causal / correlational interpretations.
I think what kevinwang is getting at, is that if you A/B test with a static version A and enough versions of B, at some point you will get statistically significant results if you repeat it often enough.
Having a control doesn't mean you can't fall victim to this.
AB tests are still vulnerable to p-hacking-esque things (though usually unintentional). Run enough of them and your p value is gonna come up by chance sometimes.
Observational ones are particularly prone because you can slice and dice the world into near-infinite observation combinations, but people often do that with AB tests too. Shotgun approach, test a bunch of approaches until something works, but if you'd run each of those tests for different significance levels, or for twice as long, or half as long, you could very well see the "working" one fail and a "failing" one work.
I don't think these examples are bad. From a clarity standpoint, where you have multiple people looking at your experiments, the first one is quite bad and the second one is much more informative.
Requiring a user problem, proposed solution, and expected outcome for any test is also good discipline.
Maybe it's just getting into pedants with the word "hypothesis" and you would expect the other information elsewhere in the test plan?
if you have done that properly, why ab testing? if you did that improperly, why bother?
ab testing moves from an hypotesis, because ab testing is done to inform a bayesian analysis to identify causes.
if one knows already that the reason is 'button not visible enough' ab testing is almost pointless.
not entirely pointless, because you can still do ab testing to validate that the change is in the right direction, but investing developer time for production quality code and risking business to just validate something one already knows seems crazy compared to just ask a focus group.
when you are unsure about the answer, that's when investing in ab testing to discovery makes the most sense.
Except you can never be certain that the changes made were impactful in the direction you're hoping unless you measure it. Otherwise it's just wishful thinking.
I didn't say anything to the contrary, the quotation is losing all the context.
but if you want to verify hipotesis and control for confounding factor, the ab test needs to be part of a baesyan analysis, if you're doing that, why also pay for the priori research?
by going down the path of user research > production quality release > validation of the hypotesis you are basically paying research twice and paying development once regardless of wether the testing is succesful or not.
it's more efficient to either use bayesian hypotesis + ab testing for research (so pay development once per hypotesis, collect evidence and steer into the direction the evidence points to) or use user research over a set of POCs (pay research once per hypotesis, develop in the direction that research points to)
if your research need validation, you paid for a research you might not need. if you start research knowing the priory (the user doens't see the button) you're not actually doing research, you're just gold plating a hunch, then why pay for research, just skip to the testing phase. if you want to research from the users, you do ab testing, but again, not against a hunch, but against a set of hypotesis, so you can eliminate confounding factors and narrow down the confidence interval.
Having a clearly stated hypothesis and supplying appropriate context separately isn't pedantry. It is semantics, but words result in actions that matter.
As kevinwang has pointed out in slightly different terms: the hypothesis that seems wooly to you seems sharply pointed to others (and vice versa) because explanationless hypotheses ("changing the colour of the button will help") are easily variable (as are the colour of the xkcd jelly beans), while hypotheses that are tied strongly to an explanation are not. You can test an explanationless hypothesis, but that doesn't get you very far, at least in understanding.
As usual here I'm channeling David Deutsch's language and ideas on this, I think mostly from The Beginning of Infinity, which he delightfully and memorably explains using a different context here: https://vid.puffyan.us/watch?v=folTvNDL08A (the yt link if you're impatient: https://youtu.be/watch?v=folTvNDL08A - the part I'm talking about starts at about 9:36, but it's a very tight talk and you should start from the beginning).
Incidentally, one of these TED talks of Deutsch - not sure if this or the earlier one - TED-head Chris Anderson said was his all-time favourite.
plagiarist:
> That doesn't test noticing the button, that tests clicking the button. If the color changes it is possible that fewer people notice it but are more likely to click in a way that increases total traffic.
"Critical rationalists" would first of all say: it does test noticing the button, but tests are a shot at refuting the theory, here by showing no effect. But also, and less commonly understood: even if there is no change in your A/B - an apparently successful refutation of the "people will click more because they'll notice the colour" theory - experimental tests are also fallible, just as everything else.
Will watch the TED talk, thanks for sharing. I come at this from a medical/epidemiological background prior to building software, and no doubt this shapes my view on the language we use around experimentation, so it is interesting to hear different reasoning.
Good to see an open mind! I think most critical rationalists would say that epidemiology is a den of weakly explanatory theories.
Even though I agree, I'm not sure that's 100% epidemiology's fault by any means: it's just a very difficult subject, at least without measurement technology, computational power, and probably (machine or human) learning and theory-building that even now we don't have. But, there must be opportunities here for people making better theories.
> Bad Hypothesis: Changing the color of the "Proceed to checkout" button will increase purchases.
This is succinct, clear, and is very clear what the variable/measure will be.
> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page. This will then lead to more purchases.
> User research showed that users are unsure of how to proceed to the checkout page.
Not a hypothesis, but a problem statement. Cut the fluff.
> Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page.
This is now two hypotheses.
> This will then lead to more purchases.
Sorry I meant three hypotheses.