the algorithms for using binary search to efficiently reduce a set satisfying some predicate to a locally minimal satisfying subset* are new to me (though cox says zeller published a slightly buggy version in 01999! and meta's cinder a correct one in 02021), and seem brilliant; their applications are not limited to debugging. i wonder how it relates to hypothesis's test-case reduction algorithm; can one of them be considered an application of the other?
also, this idea of binary-search debugging of the program's call tree rather than your revision history (or program input, or a set of data instances) is also a brilliant one. and although they published it a decade ago, i hadn't heard about it until now
the examples of asynctimerchan=1, changing optimization settings, and changing sort algorithms have in common that in some sense they are behavior-preserving, so you can toggle them on and off at will during execution without breaking anything. i wonder how to apply this call-tree debugging if the change you're trying to narrow down is a change that has to be consistent throughout the program's execution. for example, suppose some code using your hash tables breaks when you switch to a new hash function, maybe because it inadvertently depended on enumeration order. if you change the hash function partway through the program, you won't be able to find things in your hash tables after that. you could change the algorithm per table, of course, and narrow it down to a particular table, but that won't give you the particular line of code
i need to think a bit more about this issue of 'hashing a list of program counters'. you could of course number the sequence of all subroutine invocations during a (deterministic! single-threaded!) execution, as gas does for macro invocations, and binary-search that dense numbering. (this is a variant of the technique carry_bit is calling 'optimization fuel', but one that requires support from a compiler or debugger.) but, since you're toggling options on and off that will change the number of subroutine calls, the numbering won't be stable; so this will tend to only reliably find single-culprit failures
you could possibly get a stable-enough numbering using pathnames like /3/5/1, meaning the the 1st subroutine called from the 5th subroutine called from the 3rd subroutine called from main(). that seems like it might in some sense be stabler than hashing the entire list of return addresses, and it would certainly permit a lower-overhead implementation using a debugger and breakpoints rather than a check in every leaf call. plausibly i'm overlooking a flaw in this form of 'sequential numbering'? does the hashed list get truncated at some point for stability?
often when you have a change that is in some sense behavior-preserving, which is to say, you have two ways to do the same thing, you can use generative testing systems like hypothesis to detect bugs in either of them: process the same input through both paths and verify that the results are equivalent in the appropriate sense. this doesn't require the instrumentation infrastructure russ is using here, but it does depend on you being able to identify the relevant 'input', which can be very hard
in itself that doesn't help with the kinds of bugs he's talking about here, though: bugs where both the old and new code is 'equivalent' by your lights, but some other client code that calls it doesn't find it equivalent. this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code. aslr and hash-table seed randomization are doing this for us for some purposes, but unlike generative-testing frameworks, they provoke outages in production, don't do test-case minimization, and don't record failing cases to make bisection easy and prevent later regressions. and they don't do things like shuffling the input to a non-stable sort subroutine
binary-search debugging does indeed feel magical. scaevolus seems to be saying there's a bayesian generalization of it for nondeterministic bugs that are effectively random? you can of course run the test 5 (or 1000) times on each revision you're binary-searching over, but it feels like, if the number of revisions you're searching over is several thousand, you ought to be able to get some additional advantage out of running the test once on each of 5 (or 1000) revisions. can you solve this just by maximizing the expected shannon information of each test?
on a side note, it's pretty appalling that 30 years ago the plan9 group had `yesterday -d -n 7 anyfilename` to see what changed in the last week, thanks to their optical jukebox, while in the mainstream we still struggle with accidental file deletion and overwrites despite routinely carrying around terabytes in our pockets
on an even more marginally relevant note, earlier this week i was perusing the 7th edition unix kernel, in which the subroutine that switches stacks (the one with the well-known comment in 6th edition) is called swtch(). and tonight i just realized why russ cox uses that domain name
______
* conventionally this is just called a 'minimal satisfying subset', because it's 'minimal' in the partial-order sense, but i think cox's term 'locally minimal' is clearer
One small correction: we were doing bisect over compiler optimizations a decade ago, but bisect over call trees only happened in the past year or so.
For the hash table case, deciding the function per-table as you suggest is probably good enough. In a large program there are going to be tons of hash tables, and if bisect gets you to "it's this specific hash table allocated by this call stack that matters", that's a huge win.
The timer change is much like the hash table one, in that we toggle the "timer kind" at creation time, but it's only later operations that actually change behavior. Still, identifying the specific timer and call stack that created it turned out to be good enough in all cases.
thank you very much for the correction and the clarifications—and for the excellent post
does the list of callsite return counters being hashed get truncated at some point for stability? i'd think that hashing the call stack f→g→j→k→m→n→o→p to a hash unrelated to f→g→h→j→k→m→n→o→p would tend to interfere with the 'forced' set, but maybe this is only being used for things where the behavior is so closely equivalent that this problem doesn't arise
The assumption is that for a given test program, the call stack used at the specific moment where the failure begins is the same each run. If that's not true, there will be problems.
If there are two different call stacks that get to the function in question, but only one of them is where the failure begins, you absolutely want to distinguish those.
In practice we do cut off the stacks, using just 16 frames, but that's to avoid O(stack depth) overhead, not to coalesce different things.
i was reading zeller's book today and came to his example of delta-debugging minimization, which his ddmin algorithm solves in 48 tests, and it seemed like your 'easy' algorithm would be more efficient, so i hacked together a test program. your algorithm only takes 29 tests on this case. but because i couldn't remember your algorithm clearly, i first accidentally implemented a different variant that also works, but only takes 21 tests. it's not more efficient in the general case. still, it seems interesting, because it is arguably simpler than your 'easy' version (at least if we sweep the call to first_satisfying_bisect under the standard library) but still preserves logarithmic performance when the satisfying set it finds is of bounded size:
def bisect_dumb(p, s):
end, need = len(s), []
while True:
need.sort()
needed = [s[j] for j in need]
last = first_satisfying_bisect(lambda i: p(s[:i] + needed), 0, end)
if last == 0:
return need
end = last - 1
need.append(end)
def bisect_easy(p, s, targets=None, need=[]):
if targets is None:
targets = list(range(len(s)))
if not targets or p([s[i] for i in sorted(need)]):
return []
if len(targets) == 1:
return [targets[0]]
m = len(targets)//2
left, right = targets[:m], targets[m:]
left_reduced = bisect_easy(p, s, left, right + need)
right_reduced = bisect_easy(p, s, right, left_reduced + need)
return left_reduced + right_reduced
bisect_easy is much faster if the minimal set it finds, of some size m, is a long substring a long way from the beginning of the input; i think it takes time proportional to log m + log n in that case, while bisect_dumb takes m log n
however, in cases where hashing scatters the size-m minimal set all over the search space, instead of compacting it into a little lump, i think it also ends up with m log n performance
i may get to implementing the easy/hard algorithm and zeller's ddmin. also, i have the intuition that, as with quicksort, trisection will have consistently but not overwhelmingly better performance than bisection for this purpose (though not in bisect_dumb)
> this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code
There is a compiler fuzzing method called "equivalence modulo inputs" that does something like this. Take a fuzzed program with fixed inputs and profile it to find parts that are never executed. Change only code in those never executed parts. If the original and modified programs behave differently, you found a compiler bug: The compiler got confused by code that it can see but that (unknown to the compiler) cannot dynamically influence the program's behavior on the given inputs.
Like so many things in fuzzing, this sounds silly but actually finds bugs in practice.
the algorithms for using binary search to efficiently reduce a set satisfying some predicate to a locally minimal satisfying subset* are new to me (though cox says zeller published a slightly buggy version in 01999! and meta's cinder a correct one in 02021), and seem brilliant; their applications are not limited to debugging. i wonder how it relates to hypothesis's test-case reduction algorithm; can one of them be considered an application of the other?
also, this idea of binary-search debugging of the program's call tree rather than your revision history (or program input, or a set of data instances) is also a brilliant one. and although they published it a decade ago, i hadn't heard about it until now
the examples of asynctimerchan=1, changing optimization settings, and changing sort algorithms have in common that in some sense they are behavior-preserving, so you can toggle them on and off at will during execution without breaking anything. i wonder how to apply this call-tree debugging if the change you're trying to narrow down is a change that has to be consistent throughout the program's execution. for example, suppose some code using your hash tables breaks when you switch to a new hash function, maybe because it inadvertently depended on enumeration order. if you change the hash function partway through the program, you won't be able to find things in your hash tables after that. you could change the algorithm per table, of course, and narrow it down to a particular table, but that won't give you the particular line of code
i need to think a bit more about this issue of 'hashing a list of program counters'. you could of course number the sequence of all subroutine invocations during a (deterministic! single-threaded!) execution, as gas does for macro invocations, and binary-search that dense numbering. (this is a variant of the technique carry_bit is calling 'optimization fuel', but one that requires support from a compiler or debugger.) but, since you're toggling options on and off that will change the number of subroutine calls, the numbering won't be stable; so this will tend to only reliably find single-culprit failures
you could possibly get a stable-enough numbering using pathnames like /3/5/1, meaning the the 1st subroutine called from the 5th subroutine called from the 3rd subroutine called from main(). that seems like it might in some sense be stabler than hashing the entire list of return addresses, and it would certainly permit a lower-overhead implementation using a debugger and breakpoints rather than a check in every leaf call. plausibly i'm overlooking a flaw in this form of 'sequential numbering'? does the hashed list get truncated at some point for stability?
often when you have a change that is in some sense behavior-preserving, which is to say, you have two ways to do the same thing, you can use generative testing systems like hypothesis to detect bugs in either of them: process the same input through both paths and verify that the results are equivalent in the appropriate sense. this doesn't require the instrumentation infrastructure russ is using here, but it does depend on you being able to identify the relevant 'input', which can be very hard
in itself that doesn't help with the kinds of bugs he's talking about here, though: bugs where both the old and new code is 'equivalent' by your lights, but some other client code that calls it doesn't find it equivalent. this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code. aslr and hash-table seed randomization are doing this for us for some purposes, but unlike generative-testing frameworks, they provoke outages in production, don't do test-case minimization, and don't record failing cases to make bisection easy and prevent later regressions. and they don't do things like shuffling the input to a non-stable sort subroutine
binary-search debugging does indeed feel magical. scaevolus seems to be saying there's a bayesian generalization of it for nondeterministic bugs that are effectively random? you can of course run the test 5 (or 1000) times on each revision you're binary-searching over, but it feels like, if the number of revisions you're searching over is several thousand, you ought to be able to get some additional advantage out of running the test once on each of 5 (or 1000) revisions. can you solve this just by maximizing the expected shannon information of each test?
on a side note, it's pretty appalling that 30 years ago the plan9 group had `yesterday -d -n 7 anyfilename` to see what changed in the last week, thanks to their optical jukebox, while in the mainstream we still struggle with accidental file deletion and overwrites despite routinely carrying around terabytes in our pockets
on an even more marginally relevant note, earlier this week i was perusing the 7th edition unix kernel, in which the subroutine that switches stacks (the one with the well-known comment in 6th edition) is called swtch(). and tonight i just realized why russ cox uses that domain name
______
* conventionally this is just called a 'minimal satisfying subset', because it's 'minimal' in the partial-order sense, but i think cox's term 'locally minimal' is clearer