These are the 3 questions I ask my team on non-deterministic errors:

- Can you reproduce it? (locally)

- No? Then can they reproduce it? (remotely)

- No? Then can you follow the flow byte-by-byte by just looking at the code? You should.

If you can reproduce it, great, you can most probably brute force your way into the cause with local monkey-logging or step-by-step debugging.

If a customer can reproduce it then you may have a shot at remote debugging, injecting logging or requesting a dump of some sort. That's why it's important for an app to have good tools built-in so a customer can send back useful debug info.

If you can't reproduce it, then give it a shot at following the flow byte-by-byte. Either mentally, with test cases or a combination of both. Here's a quick guide from the top of my head:

- determine if there are black spots where the variable, stack, heap etc. could have unexpected data or your assumptions could be wrong or your understanding of the language, library or any technology supporting the logic could be incomplete or needs a reread of the manual.

- order your black spots by probability, starting with the most vulnerable code related to the bug (ie, for that infinite loop bug the recursive function tops the rank for weak spot)

- now compare the bug symptoms against such vulnerable code to check if there's 100% match. That way you make sure all symptoms can be caused by the alleged culprit.

- do negative symptom match also, thinking of symptoms that would be caused by that fault and make sure they can be observed (ie, the recursive function writes zeros to a file beside looping forever - did it happen?)

- if there's more than one possible cause, apply Occram's razor: the simpler one, with the least assumptions, although unlikely, is the cause.

- if no possible explanation exists still, start over with less moving parts.

- if a vulnerable fragment as been identified, but no concrete cause or solution found, rewrite the code for robustness, with plenty of assertions, complementary logging and clear error messages. This is a good practice every time you revisit code it should come out cleaner and more robust than before.