Hacker News new | past | comments | ask | show | jobs | submit login

The OPs comment is probably a testament of that. With such a poorly designed A/B test I doubt this has a p-value of < 0.10.



Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?


  > If so, where do they indicate they failed to randomize/blind the raters?

  Win rate if user is under time constraint
This is hard to read tbh. Is it STEM? Non-STEM? If it is STEM then this shows there is a bias. If it is Non-STEM then this shows a bias. If it is a mix, well we can't know anything without understanding the split.

Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.


Because you're not testing "will a user click the left or right button" (for which asking a thousand users to click a button would be a pretty good estimation), you're testing "which response is preferred".

If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.


Yes, I am assuming they evaluated the models in good faith, understand how to design a basic user study, and therefore when they ran a study intended to compare the response quality between two different models, they showed the raters both fully-formed responses at the same time, regardless of the actual latency of each model.


I would recommend you read the comment that started this thread then, because that's the context we're talking about: https://news.ycombinator.com/item?id=42891294


I did read that comment. I don't think that person is saying they were part of the study that OpenAI used to evaluate the models. They would probably know if they had gotten paid to evaluate LLM responses.

But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.


> If 10% of people just click based on how fast the response was

Couldn't this be considered a form of preference?

Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.


Sure, it could be, you can define "preference" as basically anything, but it just loses its meaning if you do that. I think most people would think "56% prefer this product" means "when well-informed, 56% of users would rather have this product than the other".




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: