Slow bugs

blrgeek · on Dec 31, 2014

Heisenbugs: When you test for them they don't appear anymore. (Add a printf, et voila the race condition is gone). - I've personally experienced this one. Race condition/deadlock between 3 different threads. One mutex was inside the debug function :)

http://www.catb.org/jargon/html/H/heisenbug.html

Schrödinbug: The code should/could have never worked in the first place, but did. Once you saw the code, and realized it shouldn't work, it stops working.

http://www.catb.org/jargon/html/S/schroedinbug.html

pavlov · on Dec 31, 2014

The "interstellarbug"... At the center of the entire system is an enormous black hole of collapsed mystery code. It is unimaginably dense with unknown, unmeasurable bugs and virtual machines whose purpose cannot be deduced from the outside. Its API horizon must never be crossed.

A working subsystem orbits the black hole and somehow manages to perform its operations, but it's extremely hazardous to visit even that subsystem: due to the enormous gravitation of the black hole bugs, every hour you spend working on the code makes you miss five years of your real life.

coderjames · on Dec 31, 2014

Haha, that's great! We have that situation in our main product line where I work.

At the center of the system is 30 year old Z8000 assembly code that gets run through a translator during the build process to produce C code which gets compiled and linked into the (PowerPC) executable.

There aren't that many folks left that know Z8K assembly, so not many people can begin to wade into that code and fix long-standing bugs. Everyone else builds wrappers so they don't have to actually get too close to the translated assembly code. But since they don't quite understand the code they're avoiding, these wrappers tend to be fragile and prone to odd corner-case bugs themselves.

davidgerard · on Jan 1, 2015

Chernobylbug: the dangerous code is encased in a concrete sarcophagus and there's a 30km exclusion zone. Anyone who goes into the code, dies.

sillysaurus3 · on Dec 31, 2014

One effective way of dealing with a slow bug (or "heisenbug") is to rewrite the system. That sounds crazy, and perhaps it is crazy in a company environment. But when I was first learning to program, I kept introducing heisenbugs once every couple weeks or so. After spending 30 minutes trying to debug it, simply deleting the entire module and rewriting it was unreasonably effective. The total time investment was about 60 minutes.

I haven't encountered a heisenbug in a long time though. Maybe years. I'd like to think heisenbugs are inversely proportional to skill, but maybe it's just luck.

fest · on Dec 31, 2014

While that could work for small software systems (or small modules which you can rewrite), it's simply not viable solution for hardware bugs (like the one GP described) or large software projects which are mostly in the state of perpetual mess and horrible dependency graph.

This actually reminds me a type of bug every electronics novice has seen- a circuit does not work unless you touch it with your finger (or even place your finger next to circuit). Some of these bugs you can solve by rewriting the software (enable internal pullup/pulldown resistor) but some others you can't solve without soldering iron (missing GND connection).

rectang · on Dec 31, 2014

> I'd like to think heisenbugs are inversely proportional to skill, but maybe it's just luck.

Definitely some fraction of heisenbugs can be avoided through defensive programming techniques and adherence to best practices.

For example, avoiding mutable global variables reduces the potential for unexpected coupling between modules when they are combined into a larger system. Similarly, limiting concurrency to the smallest possible surface area of the codebase helps a system withstand increasingly chaotic real-world input.

Aggressive injection of randomness into unit tests (using proper random seeding) can also help to detect potential heisenbugs early.

gbrown · on Dec 31, 2014

Your probability calculations are... non standard. Although you get the right answer, it's not correct to treat a discrete probability distribution like it's continuous. A more standard approach would be to calculate, for an appropriate time unit, the single time event probability and apply a geometric distribution.

Edit: you're also seeming to mix up the time to event with the number of events in an hour. Did you actually use your derivation to get to your time estimates, or did you do a simulation?

eterm · on Dec 31, 2014

The probability density graph annoyed me! If the bug is as likely to be found in the first munute as it is during the second given it wasn't found during the first then it's actually most likely to be found early and would look like the third graph, such a thing doesn't require it being a start up bug, just a bug driven by randomness so it is a poisson process.

A flat density would suggest a limit to how long it would take, since the area cannot be infinite.

kabouseng · on Dec 31, 2014

Hi, my apologies if it isn't clear.

Another way to look at it is with a dice roll analogy. Every time you roll a dice, the odds to get a certain number is 1/6. The next roll the odds is the same, regardless of how many times you rolled the dice before. That is what the first graph is supposed to illustrate.

To get successive throw's of a certain number, that is a different question, and indeed is why you calculate the cumulative probability density function, or graph. And that is indeed the 4th graph.

Any suggestions on how I can make it clearer or more intuitive, or indeed if I have made a mistake?

eterm · on Dec 31, 2014

Yeah I think I understood the intention. In that case it's a plot of conditional probability rather than probability density.

I wouldn't change the article, it's clear enough and it's not like they're accurately plotted graphs, just sketches to get your idea across.

Edit: By the way there is a distribution that describes something which is similar to a poisson distribution but with a changing rate, it is typically used for failure rate analysis (time before failure modelling) but could also be used here to describe time before bug discovery: http://en.wikipedia.org/wiki/Weibull_distribution

gbrown · on Dec 31, 2014

He's also mixing up discrete and continuous probability distributions. He got pretty much the right answer, but as a statistician it kind of hurts to look at.

kabouseng · on Dec 31, 2014

Thanks I see I did call them density functions, which is indeed wrong. I'll change the wording.

ajb · on Dec 31, 2014

An interesting exercise. Why do you think an increasing probability distribution makes this equivalent to the halting problem, though?

Without loss of generality, an increasing probability is >= some P after some time T. Being conservative, we can assume the failure probability is zero before T and ==P after T. Of course, we don't actually know P or T, but they must be consistent with us having seen the bug at all. So we can use uninformative priors to define the posterior distribution of these values given the number of times we have seen the bug and the amount of time required in each case. Unless I'm missing something?

I once wrote a probabilistic version of git bisect (https://github.com/Ealdwulf/bbchop) for use with intermittent bugs, but it hasn't seen real use.

snake_plissken · on Dec 31, 2014

I'm chasing something like this right now that started very recently. We have a handful of telematic devices out in the field that will just randomly stop sending information; no power disconnect events, no losing network connectivity events, nothing to indicate there are any issues. One day a device is fine, then the next it isn't. What makes it even more aggravating is that some of them randomly come back online and shoot over all of the missing date from when they went dark, but apparently still functioned as intended during the interim.

At the moment I have no idea how to reproduce the issue but I have a few theories as to why it might happen: lithium ion backup batteries finally going bad, cold weather affecting the batteries, or an incompatibility with the device configuration code and its current firmware revision.

ilitirit · on Dec 31, 2014

I encountered a very annoying-to-track-down "pseudo" Heisenbug in a mobile application I'm working on a few weeks ago.

Once you logon to the application, it downloads product data (ID, description, quantity) for sales reps for that day. To cut a long story short, I'd just finished implementing a new piece of functionality and I logged on to the app to test it. The application started behaving very unexpectedly. I double-checked my code, but couldn't see anything that make it act the way it did. Then I tried logging on as a different user, et voila! It worked! Tried the old user and the problem occurred again. Tried a 3rd user and I got different behaviour again, but not the same as with the original user I was testing with. I thought it had to be a data issue, so I uninstalled the application and cleared the cache and app data, reinstalled, and the bug was gone. Whatever I tried I could not replicate the problem. I shrugged it off as local database corruption and continued working.

A few days later, the same thing happened. Same symptoms, but this time the bug disappeared without me even having to reinstall the app. After a few hours of frustrating debugging and source code reviews, I remembered something. I ran into the problem at the same time of the morning as I did before - around 12am. It turns out that the process that populates the server with daily product information that usually ran at 11pm was rescheduled to run at 12am (we weren't informed about this change). It ran for 5 to 8 mins. At the time I logged on, the data for the first user was not available yet, but the data for the second user was. To complicate things, if the server can't find daily product information for that day, it sends data from the previous week (the products are essentially the same, but the quantities may be slightly different and the IDs are tied to older stock batches). But the user I was testing with didn't exist 7 days ago so he didn't have data to fall back on. The 3rd user's daily product data also wasn't available yet, but he did exist 7 days ago so he received data with product IDs that I did not expect.

In the end I "fixed" the problem using a scheduled task that emailed tech support if the product data was not in our databases by 11:30pm.

julie1 · on Dec 31, 2014

FSM are anolgous to cellular automata, and since it is non linear algebrae, it is impossible to predict a finite sequence. FSM do indeed have most of the time regular basin of attraction with finite sequences, but there are chaotic evolutions that may results in finite (and longer than usual) sequences and infinite sequences without repetitions (given an infinite playground or CBP does the job).

And we put a lot of this chaotic systems in our software design.

The definition of a complex system is: a lot of simple system interacting with one an another. The physics and math of this typology of problem concluded that it is pretty much a non linear, non euclidean problem.

So far, what we know about these beasts is they are non predictible, BUT they are robusts around their equilibrum if not pertubated too much.

The stregnth of the beast is the network of connection but also its weakness.

If a certain one a perturbation happens, it can bring down everything (the butterlfy effect). It happens with a frequence and an odd we can't predict. And then it behaves in a way we can't predict.

So it is basically the ground of the internet: a non deterministic system used the same way a deterministic system is. What can go wrong?

We can't predict? Thus it will not happen anyway, so let's go back coding.

MichaelCrawford · on Jan 2, 2015

The author mentioned bugs that are more likely to occur sooner rather than later, such as startup bugs (errors in the boot code or kernel initialization).

One can also get shutdown bugs. If you like to brag about how long your server stays up, you'll never see them, but if you shut down your box at the end of your workday, you're likely to.

I myself isolated a bug in the Classic Mac OS 7.5.2 (or maybe it was 7.5.3) Open Transport Ethernet shutdown procedure. The way I found the bug was to write an AppleScript that consisted of "Tell Finder Restart", then placed that in the Startup Items folder.

I very quickly found that the bug occurred on networked computers, but not on those which were not connected.

For *NIX you could test the kernel's startup and shutdown by having an init script that did "shutdown -r now".

I don't ever hear about others testing this way.

I didn't invent this, it was used by another department in my same building to test development builds of System 7. They had a couple hundred Macs many of them would do nothing but reboot 24/7.