> Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.
Empirically speaking, I have a set of evals with an objective pass/fail result and a prompt. I'm doing codegen, so I'm using syntax linting, tests passing, etc. to determine success. With chain-of-thought included in the prompting, the evals pass at a significantly higher rate. A lot of research has been done demonstrating the same in various domains.
If chain-of-thought can't improve quality, how do you explain the empirical results which appear to contradict you?
The paper is interesting because CoT has been so widely demonstrated as effective. The point is that it "can" hurt performance on a subset of tasks, not that CoT doesn't work at all.
It's literally in the second line of the abstract: "While CoT has been shown to improve performance across many tasks..."
Empirically speaking, I have a set of evals with an objective pass/fail result and a prompt. I'm doing codegen, so I'm using syntax linting, tests passing, etc. to determine success. With chain-of-thought included in the prompting, the evals pass at a significantly higher rate. A lot of research has been done demonstrating the same in various domains.
If chain-of-thought can't improve quality, how do you explain the empirical results which appear to contradict you?