Evaluators with CriticGPT outperform those without 60% of the time.
So, slightly better than random chance. I guess a win is a win but I would have thought this would be higher. I'd kind have assume that just asking GPT itself if it's sure would be this kind of lift.
I’m not sure why 60 vs 40 is slightly better than random chance. A person using this system has a 50% higher success rate than those not using it. I wouldnt call this a slight better result.
You can see the plots if you prefer, or think of it this way: out of a total of 100 trials, one team gets 40 and the other gets 60 = 40 + 40 * 50%
If you want to think of a 75% win rate as a more extreme example: you could say 25% above random or you could say one team wins 3 times as many cases as the other. Both are equivalent but I think that the second way conveys the strength of the difference much better.
The results in this work are statistically significant and substantial.
Not sure what this means, but if someone asked me to critique iOS code for example, I wouldn't be much of a help since I don't know the first thing about it , other than some generic best practices.
I'm sure ChatGPT would outperform me, and I could only aid it in very limited ways.
That doesn't mean an expert iOS programmer wouldn't run circles around it.
So, slightly better than random chance. I guess a win is a win but I would have thought this would be higher. I'd kind have assume that just asking GPT itself if it's sure would be this kind of lift.