I've seen one A/B test in the wild run for a full year.
It was on a small part (test of a product name + description across the site), and the most interesting aspect was that it only made a small but measureable difference. Because of that there was no strong incentive to delete the AB test (not much harm to the user) nor make the B side permanent (too low of an effect).
In that respect, AB Tests that end early aren't a bad thing IMO: either there's a clear improvement or it's really bad, and the choice is obvious enough to not have to wait much longer.
You can measure the direct effect of a change now on something like conversions. But you can't measure the second order effects: things like trust from your users, or the effects on community quality and composition, etc.
This is a good part of why enshittification happens: lots of changes with immediate "good" impact that can be measured quantifiably, but there's also readily foreseeable negative consequences to them.
Of course, just running the test longer doesn't really address this for most possible changes.
But even this doesn't work: if you are continually making choices that erode your users' trust in you, there will eventually be an impact. It happens outside of the experiment (e.g. communications between users, general sliding changes in sentiment, etc). And you can't just spot in the time series whether you're going too far or not.
Late to this comment thread, but Amazon actually excels at this type of long term measurement, through methodologies internally called HVA/DSI and DSE (to name just a couple).
- High Value Action / Downstream Impact == using a "twins" comparison, estimate the 12-month impact of a customer taking a particular action (e.g. sign up for Prime, watch their first Prime Video, etc.), compared to one who doesn't. HVAs are basically those "A's" which turn out to have a high numeric DSI value.
- Downstream Expectation == similar but very different - instead of quantifying the impact of a single action, DSE tries to estimate the combined downstream causal impact of a user taking an initial action... there's a sophisticated methodology there that tries to strip away confounding factors like "rich people who would've shopped more anyways, are also naturally more likely to sign up for Prime", because they truly want to measure the causal benefit of Prime itself, separate from the fact that richer customers generally spend more no matter what
These are both long-term methodologies, that were explicitly designed in response to the problems of: short term experiments that didn't capture long-term negative effects, and different parts of Amazon having vastly different methodologies for measuring business impact (e.g. page views vs search impressions vs downloads vs orders vs whatever... no, everyone should optimize for the same customer level financial metric which is a flavor of growth-adjusted composite contribution profit (GCCP) that's partly derived from DSE)
> Late to this comment thread, but Amazon actually excels at this type of long term measurement
It's funny-- Amazon is the exact case I'm thinking of. I went from spending high five figures to a few hundred, and I'm seeking to eliminate that. The exact impact of these kinds of data-driven management practices has lead me to expending a whole lot of work to figure out how to give Costco, Target, and Walmart my business instead.
My complaints:
- Cutting of customer service. I didn't use customer service often, but it was always exemplary There's something wrong with my account where if I pay with points from my Amazon Visa for a book, that the book gets yoinked away out of my account a couple of days later with a "payment failed" message. The points are still deducted. I spent a few hours with customer support twice on this issue, and each time the specific book that this happened to was fixed, but the problem remains. It's clearly a backend problem, but Amazon thinks it's a better move to keep a high value customer on hold while people not empowered to fix anything futz around.
- I (believe that I) was briefly in an experiment with an alternate "buy it now" order flow that would pretty reliably charge me for 2 of whatever item I was seeking to buy. Support wasn't helpful. I have video.
- Overall devolution of the retail marketplace into a flea market full of counterfeit, dubious goods.
- Aggressive attempts to upsell me back into the Prime ecosystem (e.g. the whole "Iliad flow" thing).
I'm sure all of these business decisions and changes looked great on initial measurements, but they're traps later. Worse, they turn people like me into people that were formerly Amazon evangelists to people who work to help friends use other marketplaces.
Even a year isn't sufficient time: none of these things pissed me off within a year of the change. And they're pretty difficult to capture, because they're hopelessly confounded with other changes in the market and consumer sentiment and Amazon doesn't roll things out slowly enough have a truly different contingent experiencing a different business.
Heck, maybe even all the interactions with me look positive on your metrics, depending on how you weigh downstream effects in your model: a previously valuable customer has "payment problems", begins consuming excessive support resources, then leaves.
There's a lot of good feedback to chew through, but I'll refrain from diving in too deep, and just mention that, as important as the HVA/DSI methodology is, there's been a comparatively lower amount of research done in "negative HVAs". In theory, one can do the same type of analysis to compare "twins" and pick out the NEGATIVE value of having repeat payment problems or repeat unsuccessful customer service interactions. Optimizing for growing the positive HVAs, is fundamentally different from optimizing to reduce the negative ones, but Amazon has the tools to get there or to do both, if it wants/needs to.
And yes, 12 months is arbitrary and doesn't capture everything, and longer windows of analysis are possible, but waiting even longer just throws the signal-to-noise ratio too far in the direction of noise.
FWIW, I no longer at Amazon, but I've yet to see a company of significant scale apply this level of econometrics so rigorously in day to day business decisions, or that they would evaluate 12 months as a baseline (most companies and most A/B tests are much shorter, obviously). I'm sorry you've had bad experiences, and anyways I think it's overall good for society to cultivate strong alternatives to Amazon, but as invisible as it may be to you as a consumer, your data and your lost value as a customer are definitely accounted for within these methodologies, even if no visible changes are happening or they're not winning you back.
I appreciate your comment. I guess what I'm saying is:
I love statistics and econometrics and testing beliefs with data.
But at some point, you do need to think about how to relate to human beings and what is, overall, "good business." That is, data are not replacements for clinical judgment about what is reasonable.
Coming up with ever-more-sophisticated ways to measure what is revenue maximizing but "not quite too abusive" isn't how we keep a good reputation or create a good world to live in.
Of course, completely ignoring indicators and making choices purely based on intuition and values isn't great, either.
It was on a small part (test of a product name + description across the site), and the most interesting aspect was that it only made a small but measureable difference. Because of that there was no strong incentive to delete the AB test (not much harm to the user) nor make the B side permanent (too low of an effect).
In that respect, AB Tests that end early aren't a bad thing IMO: either there's a clear improvement or it's really bad, and the choice is obvious enough to not have to wait much longer.