Hacker News new | past | comments | ask | show | jobs | submit login

Summary: if you run an experiment where you try to rush users to convert, and you only run the experiment for a short time, it will look great even though it might be lossy overall, because you're capturing a larger proportion of conversions in the experiment group.

You can also run into this sort of problem with user learning effects, where initially a large change in the UI can give a large change in behavior due to novelty, but then it wears off over time. Running experiments longer helps a lot in both cases.




This is a good summary because the math is hugely distracting from the basic realities.

You need to have basic intuition for what might be happening, the math is just a formality and frankly, is unnecessary beyond a very, very simple calculation.

You have to actually 'think' about behaviour a bit if you want to get it right, that's the hard part.

If you have something reasonable, then the conversions/control numbers can be worked out into a probability of success very quickly, and even then, just looking at them will give you a good idea if it worked or not.

The maths is a shiny lure for technical people, it gets us all excited as though there is some kind of truth behind it.


You're summary is incorrect.

Rather, these are simulated data for a fictitious company. The author is demonstrating a scenario in which a purely frequentist approach to A/B testing can result in erroneous conclusions, whereas a Bayesian approach will avoid that error. The broad conclusions are (as noted explicitly at the end of the article):

- The data generating process should dictate the analysis technique(s)

- lagged response variables require special handling

- Stan propaganda ;) but also :(

It would be cool to understand what the weaknesses or risks of erroneous conclusion to the Bayseian approach in this or similar scenarios. In other words, is it truly a risk-free trade off to switch from a frequentist technique to a Bayesian technique, or are we simply swapping one set of risks for another?

tl;dr The author's point is not to make a general claim about the aggressiveness of CTAs.


While I am generally in favor of applying Bayesian approaches, that's overkill for this problem. In their (fictitious) example, the key problem is that they ran their test for too short a time. They already know that the typical lag from visit to conversion on their site is longer than a week, which means that if they want to learn the effect on conversions a week isn't enough data.

While it is possible to make some progress on this issue with careful math, simply running the test longer is a far more effective and robust approach.


I'm no statistician, but don't you have the same problem however long you run it? Giving even more time for slow conversions to amass?

Also you and GP are calling the example fictitious, but seems to based on 'real traffic logs' via https://dl.acm.org/doi/10.1145/2623330.2623634


We're taking the author at his word:

> "Let us consider the following fictitious example in which Larry the analyst of the internet company Nozama"

Nozama is Amazon backwards.


> - The data generating process should dictate the analysis technique(s)

And to expand on this, the data generating process is not about a statistical distribution or any other theoretical construct. Only in the frequentist world do you start with assuming a generating process (for the null hypothesis, specifically).

The data generating process in this case are living, breathing humans doing things humans do.


The data generating process is the random assignment of people to experiment groups.

The potential outcomes are fixed: if a person is assigned to one group the outcome is x1; if another, x2. No assumption is made about these potential outcomes. They are not considered random, unless the Population Average Treatment Effect is being estimated. And even in that case, no distribution is assumed. It certainly is not Gaussian for example.

Under random assignment, the observed treatment effect is unbiased for the Sample Average Treatment Effect. So again, the data generating process of interest to the analyst is random assignment.


Assuming you're able to actually achieve truly random participation in the various arms you're trialing, you're right.

And it's my fault for not thinking of that as a possibility. Colour me jaded after experiencing very many bad attempts at randomization that actually suffer from Simpson's paradox in various ways!


You're absolutely correct, proper A/B testing has many engineering challenges!


Wouldn't it be better to run the experiment longer _and_ discard the data from initial few weeks?


This could make the entire org/company run and innovate much slower. Ideally you can build better models that predict long term conversion from short term data. These models can be refined with long term experiments.


The current state of browser tracking preventions also means that you’re unlikely to identify conversions from the same user that saw your experiment after a week or sometimes even 24 hours.


Yes, browser tracking prevention is one of those things that seems like a good idea at first but likely makes the internet slightly worse overall.

Sites can only optimize for what they can see and we've made it so they can only see short-term engagement.

Another is all the annoying cookie popups as a result of GDPR.


You haven't convinced me that preventing browser tracking is making the internet "slightly worse overall".

If sites are having trouble converting me, perhaps it's not me that's the problem.


The issue is most sites can no longer tell if they are converting you


It's not obvious to me that that is a problem for me, or that it makes the internet worse


The popups are a result of tracking, not GDPR. Websites without tracking don't need to have them.

It's somewhat amusing that the overlap of garbage content farms and sites with annoying consent popups is almost perfect. I wonder if it could be used for search engine ranking.


I don't get this summary how are you capturing a larger part of conversations


Conversions, not conversations, if that helps?


Also it might be hard to ensure you aren't externalising "cost"




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: