These are the 3 questions I ask my team on non-deterministic errors:
- Can you reproduce it? (locally)
- No? Then can they reproduce it? (remotely)
- No? Then can you follow the flow byte-by-byte by just looking at the code? You should.
If you can reproduce it, great, you can most probably brute force your way into the cause with local monkey-logging or step-by-step debugging.
If a customer can reproduce it then you may have a shot at remote debugging, injecting logging or requesting a dump of some sort. That's why it's important for an app to have good tools built-in so a customer can send back useful debug info.
If you can't reproduce it, then give it a shot at following the flow byte-by-byte. Either mentally, with test cases or a combination of both. Here's a quick guide from the top of my head:
- determine if there are black spots where the variable, stack, heap etc. could have unexpected data or your assumptions could be wrong or your understanding of the language, library or any technology supporting the logic could be incomplete or needs a reread of the manual.
- order your black spots by probability, starting with the most vulnerable code related to the bug (ie, for that infinite loop bug the recursive function tops the rank for weak spot)
- now compare the bug symptoms against such vulnerable code to check if there's 100% match. That way you make sure all symptoms can be caused by the alleged culprit.
- do negative symptom match also, thinking of symptoms that would be caused by that fault and make sure they can be observed (ie, the recursive function writes zeros to a file beside looping forever - did it happen?)
- if there's more than one possible cause, apply Occram's razor: the simpler one, with the least assumptions, although unlikely, is the cause.
- if no possible explanation exists still, start over with less moving parts.
- if a vulnerable fragment as been identified, but no concrete cause or solution found, rewrite the code for robustness, with plenty of assertions, complementary logging and clear error messages. This is a good practice every time you revisit code it should come out cleaner and more robust than before.
If you can reproduce the bug 99% of the problem is solved. I doubt I have spent more than a day fixing a bug that I could reliably trigger.
It is the non-deterministic bugs that drive me crazy. I have one bug where a call to a third party library randomly fails but only after the program has been running for days (no it is not a memory leak). If I make a cut down stub then the error never occurs even after running for a week. My best guess is I am trashing memory somewhere, but under valgrind everything is fine. Arg!
- Can you reproduce it? (locally)
- No? Then can they reproduce it? (remotely)
- No? Then can you follow the flow byte-by-byte by just looking at the code? You should.
If you can reproduce it, great, you can most probably brute force your way into the cause with local monkey-logging or step-by-step debugging.
If a customer can reproduce it then you may have a shot at remote debugging, injecting logging or requesting a dump of some sort. That's why it's important for an app to have good tools built-in so a customer can send back useful debug info.
If you can't reproduce it, then give it a shot at following the flow byte-by-byte. Either mentally, with test cases or a combination of both. Here's a quick guide from the top of my head:
- determine if there are black spots where the variable, stack, heap etc. could have unexpected data or your assumptions could be wrong or your understanding of the language, library or any technology supporting the logic could be incomplete or needs a reread of the manual.
- order your black spots by probability, starting with the most vulnerable code related to the bug (ie, for that infinite loop bug the recursive function tops the rank for weak spot)
- now compare the bug symptoms against such vulnerable code to check if there's 100% match. That way you make sure all symptoms can be caused by the alleged culprit.
- do negative symptom match also, thinking of symptoms that would be caused by that fault and make sure they can be observed (ie, the recursive function writes zeros to a file beside looping forever - did it happen?)
- if there's more than one possible cause, apply Occram's razor: the simpler one, with the least assumptions, although unlikely, is the cause.
- if no possible explanation exists still, start over with less moving parts.
- if a vulnerable fragment as been identified, but no concrete cause or solution found, rewrite the code for robustness, with plenty of assertions, complementary logging and clear error messages. This is a good practice every time you revisit code it should come out cleaner and more robust than before.