Summary: if you run an experiment where you try to rush users to convert, and you only run the experiment for a short time, it will look great even though it might be lossy overall, because you're capturing a larger proportion of conversions in the experiment group.
You can also run into this sort of problem with user learning effects, where initially a large change in the UI can give a large change in behavior due to novelty, but then it wears off over time. Running experiments longer helps a lot in both cases.
This is a good summary because the math is hugely distracting from the basic realities.
You need to have basic intuition for what might be happening, the math is just a formality and frankly, is unnecessary beyond a very, very simple calculation.
You have to actually 'think' about behaviour a bit if you want to get it right, that's the hard part.
If you have something reasonable, then the conversions/control numbers can be worked out into a probability of success very quickly, and even then, just looking at them will give you a good idea if it worked or not.
The maths is a shiny lure for technical people, it gets us all excited as though there is some kind of truth behind it.
Rather, these are simulated data for a fictitious company. The author is demonstrating a scenario in which a purely frequentist approach to A/B testing can result in erroneous conclusions, whereas a Bayesian approach will avoid that error. The broad conclusions are (as noted explicitly at the end of the article):
- The data generating process should dictate the analysis technique(s)
- lagged response variables require special handling
- Stan propaganda ;) but also :(
It would be cool to understand what the weaknesses or risks of erroneous conclusion to the Bayseian approach in this or similar scenarios. In other words, is it truly a risk-free trade off to switch from a frequentist technique to a Bayesian technique, or are we simply swapping one set of risks for another?
tl;dr
The author's point is not to make a general claim about the aggressiveness of CTAs.
While I am generally in favor of applying Bayesian approaches, that's overkill for this problem. In their (fictitious) example, the key problem is that they ran their test for too short a time. They already know that the typical lag from visit to conversion on their site is longer than a week, which means that if they want to learn the effect on conversions a week isn't enough data.
While it is possible to make some progress on this issue with careful math, simply running the test longer is a far more effective and robust approach.
> - The data generating process should dictate the analysis technique(s)
And to expand on this, the data generating process is not about a statistical distribution or any other theoretical construct. Only in the frequentist world do you start with assuming a generating process (for the null hypothesis, specifically).
The data generating process in this case are living, breathing humans doing things humans do.
The data generating process is the random assignment of people to experiment groups.
The potential outcomes are fixed: if a person is assigned to one group the outcome is x1; if another, x2. No assumption is made about these potential outcomes. They are not considered random, unless the Population Average Treatment Effect is being estimated. And even in that case, no distribution is assumed. It certainly is not Gaussian for example.
Under random assignment, the observed treatment effect is unbiased for the Sample Average Treatment Effect. So again, the data generating process of interest to the analyst is random assignment.
Assuming you're able to actually achieve truly random participation in the various arms you're trialing, you're right.
And it's my fault for not thinking of that as a possibility. Colour me jaded after experiencing very many bad attempts at randomization that actually suffer from Simpson's paradox in various ways!
This could make the entire org/company run and innovate much slower. Ideally you can build better models that predict long term conversion from short term data. These models can be refined with long term experiments.
The current state of browser tracking preventions also means that you’re unlikely to identify conversions from the same user that saw your experiment after a week or sometimes even 24 hours.
The popups are a result of tracking, not GDPR. Websites without tracking don't need to have them.
It's somewhat amusing that the overlap of garbage content farms and sites with annoying consent popups is almost perfect. I wonder if it could be used for search engine ranking.
I've dealt with this enough that at this point I'm convinced all companies that do this fail to see the users through the metrics. A/B testing is overvalued.
We stopped doing A/B tests after I insisted that they all be done as A/A/B tests. Suddenly the "clear winners" weren't so clear after all. It confused and frustrated the marketing department so much that it was decided just to stop doing them all together.
The reason I wanted this type of test was because it was a waste of time testing shades of blue or two headlines that only differed by 2 words. The test variants were never radical enough to see any kind of significant uplift. Then after 5-10 tests the design starts to suffer by wandering down some weird path that nobody would consciously design from the outset. But the series of test "winners" made things go off in wild directions.
I still think there is some value in A/B testing (A/A/B only, if I'm honest). But in a small team, it's a waste of time.
For an A/A/B test are you taking three samples (instead of two), and two of the three get shown the same thing (A)? Then you only consider the results for the B group if the two A groups show the same behavior?
Not the person you're responding to, but yes, that's the idea. It's a control not for the the B but for the unknown unknowns that may or may not be there.
If A' and B both statistically differ from A, then you have a problem because you're not testing what you think you are testing, regardless of what your naive A/B test's p-value would have indicated.
You take three samples and two of three get shown the same thing. What happens here is both A groups will show different results until your sample grows enough that the CI window becomes small.
This helps to show the effect of a low sample from a non-uniform distribution.
A lot of people (me included) think they know statistics, but they don't.
The blogpost in OP also tries to explain the same thing - you shouldn't do statistics without understanding what it is your doing.
dont you then need to run the experiment for a long long time to get the significance of it? + the site needs enough users viewing the page to run it in any reasonable amount of time.. i would think most sites dont get enough traffic to run a/a let alone ab?
My experience is similar. Even if and when the metrics are calculated properly, there's often some design or business reason put forth as an excuse to ignore them.
Before any effort goes into something like this I always raise my hand and ask, "If the data shows us something we don't want to see, will we change our strategy? If not, I'd rather put time/effort into other projects." It works about 80% of the time.
Metrics are only useful if the organization is actually willing to learn lessons from them.
Oh yeah, defining the thresholds and their associated courses of action ahead of time is important to make a good decision.
Any time someone wants to measure something, the top two questions should be "what are the lower and upper bounds this value has to exceed for us to do something different?"
Very often, it turns out these thresholds for change are so astronomical that nobody thinks we have even the slightest chance of exceeding them. That means the measurement is completely useless. Whatever result we plausibly get, it won't change anything.
Not to mention the decision paralysis and change aversion it often introduces into company culture where every change, however trivial or however obviously beneficial, has to first go through a 2-week A/B test which often turns out to be inconclusive anyways, and sometimes takes more eng resources to set up and run than it takes to make the change itself.
Previous testing should give the company at least some baseline understanding of what is trivial and what isn't. The correct way to experiment is certainly not "let's experiment every idea!"
> however obviously beneficial
If you've been around long enough, you've almost certainly run into dozens of "obviously beneficial" changes that led to poorer performance.
Most of what you're describing is issues with poor prioritization, a lack of understanding about your audience, and a culture that has a difficult time making decisions.
I've had a similar experience. Some companies will do things like A/B test fonts and button colors, yet ignore bigger things like content, it's absurd.
This whole example seems like it boils down to a poor test/analysis plan than anything that truly speaks to the value of Bayesian approaches:
1) It's almost always a bad idea to decide a test based on one-week's worth of data, regardless of what statistical approach you take
2) There's not really any info on why Fisher's exact test is used. It seems like most A/B software has adopted Bayesian but the ones who haven't, I believe, choose the Student's test and require prior sample sizing
3) The conversion delay issue was not addressed in the measurement plan. There are clear ways to address this issue, both tactically as well as mathematically. From a tactical standpoint, most testing platforms, you'd be able to change the test allocation to 0%, which would allowed previously bucketed users to continue to be measured with subsequent visits while not allowing any new users in. You could also just run the test long enough to where the conversion lag no longer has a major impact on results (this may or may not be possible, depending on how long and fat the lag tail is).
The author discusses "pull forward", where the real impact of a change is to make people purchase earlier, but we don't necessarily observe incremental purchases. This isn't necessarily bad; I'd rather have a dollar today than a dollar next week.
This can be quantified by plotting the incremental conversions observed by day x. We migh see a big initial lift that degrades over time. If it eventually degrades to zero, there are no truly incremental conversions, just pull-forward. But if we end up pulling forward a meaningful number of purchases by a month or more, that can be valuable to the business!
I wouldn't immediately jump to a complicated mathematical model to handle this situation, I would consider the business implications first and foremost.
I also urge anyone considering Bayesian methods for A/B testing to read up on the likelihood principle vs the strong repeated sampling principle (I documented my thoughts here [0]). Bayesian methods always satisfy the likelihood principle; frequentist methods always satisfy repeated sampling. In many situations both methods satisfy both principles, and then the two approaches will give similar answers. But based on many years doing A/B testing, I wouldn't give up repeated sampling lightly. Bayesian and frequentist methods are not blindly interchangeable.
On the other hand, if repeated sampling is not important in your use case, then by all means prefer the Bayesian approach! I just want people to consider the trade offs.
>the new website version implemented urgency features that gave the users the impression that the product they were considering for purchase would soon be unavailable or would drastically increase in price. This lead to the fact that some users were annoyed by this alarmist messaging and design, and now didn't convert anymore even though they might have under the old version
This basic principle has far broader implications than website design and A/B testing. Managers at large corporations have learned to pull all sorts of levers to optimize the short term value of some metric (typically the one upon which their compensation depends) often in direct opposition to the long term interests of the corporation (and even the value of that metric beyond the next few quarters).
This article reminded me of an experience I had as a developer for an online retailer. The product browse team, the one responsible for showing lists of products from searches, categories, brands, etc., had a slew of AB tests to see product detail page viewing conversion rate. One of these tests included the bright idea of removing the product name from the individual product cards. Trouble is, the names often contained differentiating descriptors. While dogfooding the app, I was thoroughly frustrated with the need to constantly go back and forth between browse and detail pages. When I spoke to the browse team about it, they were patting themselves on the back on how amazing their detail page viewing conversation rates were. Made me a skeptic of AB tests since.
A/B tests work fine if the signal you are measuring is strong. This is not the case here.
Is it even fine to use the distribution assumptions in the later analysis?
Looks like these assumptions combined with a higher conversion rate on day 2 for control is the main reason for the surprising result (control is obviously spread out).
> A/B tests work fine if the signal you are measuring is strong. This is not the case here.
The (fictitious) signal they are discussing here is very strong. Scroll down to the figure labeled "posterior distribution of p" and you can see that the two distributions barely overlap.
Yes, I saw the figure and that's why I commented that the day 2 conversions for the control are basically giving all of the information in the assumed model.
To me it just looks like a whole new batch of assumptions. Might be fictitiously valid or not.
Well, this is where it comes down to understanding statistics. Yes, that subject we all hated in college and could not wait to pass and forget about. I think I can say that most A/B testing is, to be kind, statistically flawed. At the same time, it is possible to think that only some of the largest websites might have enough traffic to do it right (whatever that means).
And then there's the big question: How much of business did you lose in the process of arriving at what seems like an optimal solution (which might just be a local peak, rather than a global optimum point)?
That said, what's the alternative? To optimize, or not, that is the question.
A/B testing can be useful at times but it's largely overrated because you're only discovering the best design out of the ones you test. That means there could be a far better design that you failed include in the experiment.
Just because one design converts more than the other doesn't mean it's the design with optimal UX. I've seen many tests where the designs included already had faulty UX. This is why it's better to have a trained UX designer on your team who can fix basic flaws and present the best version of various designs for testing.
I agree, but that's not the whole story. A/B testing isn't just for identifying the best option, it's for figuring out how much opportunity there is in a particular facet of your business. If you try a few reasonable designs, and some designs have much better performance than others, then it makes sense to continue investing time to further improve. You can take what you've learned and try to come up with even better designs.
On the other hand, if the variants mostly perform the same, why spend more time on it? Go focus elsewhere.
It certainly is a logical possibility that the next design you try will be much more impactful, but after trying several variants unsuccessfully your time is probably better spent elsewhere.
That's my point with having a trained UX designer who knows what they're doing. They can improve a design much better than the average developer or designer because they have more training. An average designer or developer is going to hit a ceiling and be spinning their wheels creating variants with minimal improvement. This is why a/b testing is overrated. In other words, too much emphasis placed on testing, testing, testing and not enough on UX design.
I'm not entirely sure what the point here is. You're calling A/B testing overrated because it doesn't involve UX designers? I'd agree that having some UX resources in test design is critical, so if that's being done up to standard, is testing still overrated?
Correlation does not imply causation. Why do people believe AB test based decisions actually improve conversion rates in the long run? These tests could be eroding fundations like usability and slowly push your followers to other sites.
I'm not sure why you are being downvoted. I completely agree with you: it sucks to suck at maths because I believe this article brings very interesting information, information that could be very useful to me, that I simply cannot understand because I suck at maths.
Just skip over the meaty statistics in the middle of the article. The conclusion is that one method of analysis may show positive results even when another, more appropriate method would show negative results.
Anti-flicker snippet is definitely the first step. Since this is a SPA and the flicker may be caused well past the point of the initial pageview, there may also be an issue with how the code is written and hooked into the SPA framework, in this case React.
You can also run into this sort of problem with user learning effects, where initially a large change in the UI can give a large change in behavior due to novelty, but then it wears off over time. Running experiments longer helps a lot in both cases.