this is a wonderful post! the algorithms for using binary search to efficiently ...

rsc · on July 19, 2024

One small correction: we were doing bisect over compiler optimizations a decade ago, but bisect over call trees only happened in the past year or so.

For the hash table case, deciding the function per-table as you suggest is probably good enough. In a large program there are going to be tons of hash tables, and if bisect gets you to "it's this specific hash table allocated by this call stack that matters", that's a huge win.

The timer change is much like the hash table one, in that we toggle the "timer kind" at creation time, but it's only later operations that actually change behavior. Still, identifying the specific timer and call stack that created it turned out to be good enough in all cases.

kragen · on July 19, 2024

thank you very much for the correction and the clarifications—and for the excellent post

does the list of callsite return counters being hashed get truncated at some point for stability? i'd think that hashing the call stack f→g→j→k→m→n→o→p to a hash unrelated to f→g→h→j→k→m→n→o→p would tend to interfere with the 'forced' set, but maybe this is only being used for things where the behavior is so closely equivalent that this problem doesn't arise

rsc · on July 19, 2024

The assumption is that for a given test program, the call stack used at the specific moment where the failure begins is the same each run. If that's not true, there will be problems.

If there are two different call stacks that get to the function in question, but only one of them is where the failure begins, you absolutely want to distinguish those.

In practice we do cut off the stacks, using just 16 frames, but that's to avoid O(stack depth) overhead, not to coalesce different things.

kragen · on July 22, 2024

i was reading zeller's book today and came to his example of delta-debugging minimization, which his ddmin algorithm solves in 48 tests, and it seemed like your 'easy' algorithm would be more efficient, so i hacked together a test program. your algorithm only takes 29 tests on this case. but because i couldn't remember your algorithm clearly, i first accidentally implemented a different variant that also works, but only takes 21 tests. it's not more efficient in the general case. still, it seems interesting, because it is arguably simpler than your 'easy' version (at least if we sweep the call to first_satisfying_bisect under the standard library) but still preserves logarithmic performance when the satisfying set it finds is of bounded size:

    def bisect_dumb(p, s):
        end, need = len(s), []

        while True:
            need.sort()
            needed = [s[j] for j in need]
            last = first_satisfying_bisect(lambda i: p(s[:i] + needed), 0, end)
            if last == 0:
                return need

            end = last - 1
            need.append(end)

the full program, with comments, is at http://canonical.org/~kragen/sw/dev3/bisectreduce.py. the corresponding version of bisect_easy would be

    def bisect_easy(p, s, targets=None, need=[]):
        if targets is None:
            targets = list(range(len(s)))
        if not targets or p([s[i] for i in sorted(need)]):
            return []
        if len(targets) == 1:
            return [targets[0]]

        m = len(targets)//2
        left, right = targets[:m], targets[m:]

        left_reduced = bisect_easy(p, s, left, right + need)
        right_reduced = bisect_easy(p, s, right, left_reduced + need)

        return left_reduced + right_reduced

bisect_easy is much faster if the minimal set it finds, of some size m, is a long substring a long way from the beginning of the input; i think it takes time proportional to log m + log n in that case, while bisect_dumb takes m log n

however, in cases where hashing scatters the size-m minimal set all over the search space, instead of compacting it into a little lump, i think it also ends up with m log n performance

i may get to implementing the easy/hard algorithm and zeller's ddmin. also, i have the intuition that, as with quicksort, trisection will have consistently but not overwhelmingly better performance than bisection for this purpose (though not in bisect_dumb)

kragen · on July 19, 2024

i see, thanks!

derdi · on July 19, 2024

> this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code

There is a compiler fuzzing method called "equivalence modulo inputs" that does something like this. Take a fuzzed program with fixed inputs and profile it to find parts that are never executed. Change only code in those never executed parts. If the original and modified programs behave differently, you found a compiler bug: The compiler got confused by code that it can see but that (unknown to the compiler) cannot dynamically influence the program's behavior on the given inputs.

Like so many things in fuzzing, this sounds silly but actually finds bugs in practice.