Aside: bisecting flakes doesn't have to involve repeated runs. You can reformulate bisection as an information probing operation, expanding the scope to support noisy benchmarks or low-probability flakes. Bayesian inference narrows down the probable range of the failure for each new observation, and you can choose new probes to maximize information gain-- or even run them in parallel to minimize the total time.
You do have to provide flake rate probability to do the probability estimates, but even roughly correct rates work fine. Running bisects assuming a 5% chance of getting a false positive or negative barely adds more steps and greatly improves robustness.
The math is pretty simple too-- my old prototype might still be in Google's monorepo; I should reimplement it for the open source world.
The Probabilistic Bisection Algorithm by Horstein, original paper is "Sequential transmission using noiseless feedback," IEEE Trans. Inform. Theory, 9(3):136–143. https://ieeexplore.ieee.org/document/1057832
Here's a nice presentation on an analysis of the method, references start on slide 47
Paraphrasing the theorem from Jedynak, Frazier, Sznitman (2011):
Suppose [flake probability] is constant, known, and bounded away from 1/2, and we use the entropy loss function. ... The policy that chooses [next test point] at the median of [current posterior distribution] is optimal.
And some python code that implements a version of the analysis
You apply the same process, but your hypotheses are "bug is before/after this bisection point". Your "probability of evidence given hypothesis before/after" are where you incorporate your guess about the tests flakiness. Still works even if you don't have "true" numbers for the tests flakiness, just won't converge as quickly
You do have to provide flake rate probability to do the probability estimates, but even roughly correct rates work fine. Running bisects assuming a 5% chance of getting a false positive or negative barely adds more steps and greatly improves robustness.
The math is pretty simple too-- my old prototype might still be in Google's monorepo; I should reimplement it for the open source world.