I'd love to see someone who knows more explore this, but: Statistical tests and processes can not be blindly applied. They all have prerequisites before they mean anything. A common one is the assumption that the underlying distribution is Gaussian. In that case, average + standard deviation tells you something very useful. On the other hand, nothing stops you from taking the average and standard deviation of a Poisson process, but the numbers will be much less meaningful, and often quite deceptive.
I'm concerned that the assumptions that are necessary for those statistical tests are not being met. Generally the significance tests are based on an assumption that all samples are independent as far as I know, but MAB grossly violates that assumption, and I do mean grossly. MAB doesn't pass those tests because it is, itself, a statistical significance test in some sense, and it is "deliberately" not feeding the independence-base significance algorithms the data it thinks it should be getting. This is not a bug, it's a feature.
(Pardon the anthropomorphizing there. It's still sometimes the best way to say something quickly in English.)
In fact, the fact that you agree that MAB has a higher conversion rate, which in this context basically means nothing more and nothing less than works better, but that there's this measure Q on which it does worse, is probably better interpreted as evidence that Q is not a useful measurement, rather than that the thing that works better shouldn't be used due to its lack of Q-ness.
In this case, because you are measuring conversion vs. no-conversion, the underlying distribution would actually be binomial. With a large enough sample size, it looks similar to a Gaussian distribution, but still slightly different.
Also, I don't see the problem of lack of independence. People are still randomly assigned to one of the conditions in the MAB scheme, people in one sample don't affect people in the other sample, and each person submits one independent datapoint. The problem is just one of unequal sample sizes in your two conditions, so when you make the comparison (in a one factor ANOVA, say), you lose lots of statistical power because your effective sample size is essentially the harmonic mean of the two sample sizes (which is heavily biased towards the lower number of the two).
"People are still randomly assigned to one of the conditions in the MAB scheme"
No, they're only partially randomly assigned to one of the conditions. MAB has a memory based on what the results of the previous runs are, which is basically another way of phrasing that there's a dependence.
I'm not so sure. By that logic, no experiment has random assignment. I know when I place subject X in condition Y that I do so because I need Z people in condition Y. Thus, I have some information about where I've placed other people. This is no different?
The point is that with bandit models, you're placing people in a cohort as a function of the previous conversion rate of your cohorts. That's the dependence.
By contrast, randomly allocating X% of your population to cohort C_i is not a dependence on anything but the free parameter (X).
"The results will still be conditionally independent given the assignments"
If the assignments are the source of the problem, you can't just take them as "given" and assume that things are OK.
A simpler example: if I assign people to cohorts based on their age, then the results of my experiment may be "conditionally independent" with respect to (say) eye color, but it probably won't be conditionally independent with respect to income or reading level or pant size. In other words, the assumption of independence is violated with respect to all variables correlated with age.
With bandit optimization, our bucket allocation scheme is correlated with time, which makes it nearly impossible to apply conventional statistical tests on any metric that may also be correlated with time. And in web testing, that's nearly everything.
I'm concerned that the assumptions that are necessary for those statistical tests are not being met. Generally the significance tests are based on an assumption that all samples are independent as far as I know, but MAB grossly violates that assumption, and I do mean grossly. MAB doesn't pass those tests because it is, itself, a statistical significance test in some sense, and it is "deliberately" not feeding the independence-base significance algorithms the data it thinks it should be getting. This is not a bug, it's a feature.
(Pardon the anthropomorphizing there. It's still sometimes the best way to say something quickly in English.)
In fact, the fact that you agree that MAB has a higher conversion rate, which in this context basically means nothing more and nothing less than works better, but that there's this measure Q on which it does worse, is probably better interpreted as evidence that Q is not a useful measurement, rather than that the thing that works better shouldn't be used due to its lack of Q-ness.