History's Worst Software Bugs (2005)

scott_s · on Aug 1, 2015

The Morris worm is on that list; Robert Morris is a partner at Y Combinator. That incident is the source of the pg quote that would not stop kicking around in my mind through the latter half of grad school:

"The danger with grad school is that you don't see the scary part upfront. PhD programs start out as college part 2, with several years of classes. So by the time you face the horror of writing a dissertation, you're already several years in. If you quit now, you'll be a grad-school dropout, and you probably won't like that idea. When Robert got kicked out of grad school for writing the Internet worm of 1988, I envied him enormously for finding a way out without the stigma of failure."

From http://www.paulgraham.com/college.html

Morris is now also a tenured MIT professor, so things ended up okay for him.

lordnacho · on Aug 1, 2015

Seem to me this list needs to incorporate how easily these bugs could have been avoided/detected/fixed, rather than just how dire the consequences were. It doesn't say much about what people did to test their code. For instance the first one in the list is something unit testing would have fixed. Take the trajectory function, plug numbers in, see if it's correct.

Some of these things were a lot more obvious than others.

Race conditions, for example, can be really hard to find, but as long as you know it might happen (these days it's just about every system) you can take precautions for testing. If it's important, maybe hire someone with experience.

The AT&T network crash thing looks pretty unobvious to me. A network graph can have a huge number of topologies, so you can't really test them all. Machines might also be using different versions of software that don't interact nicely. Sounds like they took sensible precautions and were thus able to roll back. That's why "rollback" is a word.

There's a whole class of bugs where things work and then need to be upgraded. You think it will work, because there aren't many changes and stuff is qualitatively the same. Like the number overflow bug in the Ariadne, or the buffer overflow in the finger daemon.

Retric · on Aug 1, 2015

Unit tests would be highly unlikely to catch most of those.

"a formula written on paper in pencil was improperly transcribed", "neglect to properly "seed" the program's random number generator, A HW bug that's not close to obvious numbers to check, intentionally inserted bugs, input outside of the intended design, etc.

zalzane · on Aug 1, 2015

>"a formula written on paper in pencil was improperly transcribed"

offtopic, but a unit type would have prevented that. i had no idea how many errors i was making in my math programs before i started using F#'s type checker to make sure all the types lined up properly.

GFK_of_xmaspast · on Aug 1, 2015

I don't know the actual transcription error, but how's it going to find a 5 being made a 6 or something?

zalzane · on Aug 1, 2015

it wont, but the vast majority of errors with math formulas are along the lines of adding velocities with positions, raising something to the wrong power, using a multiply instead of an add, putting a parenthesis in the wrong spot, performing equations in the wrong order, etc.

all of those can get caught with type checking, but it isn't perfect

HeyLaughingBoy · on Aug 2, 2015

Correct. A unit test is a defect removal mechanism. What these faults needed was a defect prevention mechanism. One of those mechanisms is Design/Code reviews.

With all the emphasis on testing and TDD, etc, I get the feeling that reviews are getting the shaft. They are both important, for different reasons.

jschwartzi · on Aug 1, 2015

The AT&T network crash bug was caused by a well-formed message coming from a crashed system, so that would have been caught by unit testing too.

trsohmers · on Aug 1, 2015

The non hardware based floating point bugs would not be an issue if using a variable precision floating point format, such as the in development Unum (previous HN discussion: https://news.ycombinator.com/item?id=9943589)

trsohmers · on Aug 1, 2015

Another thing which should be in this list (relating to floating point rounding error):

"On 25 February 1991, a loss of significance in a MIM-104 Patriot missile battery prevented it intercepting an incoming Scud missile in Dhahran, Saudi Arabia, contributing to the death of 28 soldiers from the U.S. Army's 14th Quartermaster Detachment."

https://en.wikipedia.org/wiki/Floating_point#Incidents

luso_brazilian · on Aug 1, 2015

No mention to Y2K and mankind can thank the millions of man/hours employed (and regally paid) to stamp out the majority of the occurrences of that bug.

It could really be a game changer if it didn't get fixed and I don't really know what expect in the wake of Y2K38 because it's about there, lurking in waiting.

BinaryIdiot · on Aug 1, 2015

> I don't really know what expect in the wake of Y2K38 because it's about there, lurking in waiting.

I've been wondering the same. The Y2K bug was easy for many places to fix. Granted I wasn't a profressional developer at that time but I've looked at historical fixes at the companies I have worked at and all of their solutions were pretty easy (change application code to use 4 numbers instead of 2, run SQL update script to update existing data, done). But the 2038 bug? That one isn't near as obvious to fix in my opinion.

perlgeek · on Aug 1, 2015

The obvious fix is to use a 64bit integer to hold timestamps.

luso_brazilian · on Aug 1, 2015

That's the fix, of course, but what about all the embedded software that will last enough to cross that barrier but that won't be upgraded from its 32bits timestamps?

pavel_lishin · on Aug 1, 2015

You'd have to find it, too. How many companies have manufactured devices with embedded software that have gone out of business, devices for which no manuals exist anymore, etc?

OliverJones · on Aug 2, 2015

FWIW, the year 2027 has the same weekdays as 2038. If the worst comes to the worst, setting the clocks back 11 years on unremediated systems has a chance of allowing them to keep working.

CookWithMe · on Aug 1, 2015

The Soviet Gas Pipeline explosion - if the whole CIA story is true at all - should not be labelled a bug... The code allegedly did exactly what it's creator intended ;-)

ksk · on Aug 1, 2015

Well, typically the users decide what is and isn't a bug. The developers can always say "I intended it to do this". ;)

rorykoehler · on Aug 2, 2015

Seems to be a popular tactic. Just like Stuxnet too.

rer0tsaz · on Aug 1, 2015

> Programmers respond by attempting to stamp out the gets() function in working code, but they refuse to remove it from the C programming language's standard input/output library, where it remains to this day.

gets was deprecated in C99 and removed in C11.

rgbrenner · on Aug 1, 2015

C11 is about 6 years newer than the article. So the article was accurate at the time it was written.

rlongstaff · on Aug 1, 2015

... but what percentage of current C code is C11. More importantly, how much new code is written according to C11?

Gibbon1 · on Aug 1, 2015

But gets() isn't just one unsafe function all of the classic string functions are totally unsafe, and most of their safer replacements are similarly bad. They do things like take buffer size and then will truncate strings and leave off the terminating zero. So then the next string function will blow up.

I think really when people manipulate strings in c/c++ they use the safe functions that come with frameworks.

TillE · on Aug 1, 2015

The title said "software", so I assumed they were going to exclude the infamous Pentium FPU bug. But no, there it is.

To me, the interesting thing about testing a CPU is that it's theoretically possible to comprehensively test all inputs and outputs, but the time required makes that totally impossible.

trsohmers · on Aug 1, 2015

Not so much anymore... there has been a ton of work put in by the EDA companies to get companies to do formal verification (which they obviously sell very expensive tools for) even before you get to physical design and testing.

For the chip my team is designing, we are formally verifying our ISA using a new domain specific language (http://www.cl.cam.ac.uk/~acjf3/l3/) which really helps lock down the "gold model" which all our other tests (Our cycle accurate C++ model, our RTL (verilog) model, and eventually the physical simulation) need to live up to.

As far as the tools provided by EDA companies, they have a ton of standard verification tools that have actually gotten a lot better and faster since the 90s, but best of all there are things like Cadence's Palladium (http://www.cadence.com/products/sd/palladium_series/pages/de...) which is basically a super FPGA like device which isbuilt specifically for verifying functionality of your circuits... while a FPGA is to 100 to maybe 1000x faster than simulating RTL, Cadence claims Palladium is up to 1,000,000x faster than RTL simulation.

Anyways: Most chips done today (especially due to the advanced process nodes) require EXTENSIVE verification that is just as long, if not longer, than the design and implementation (though it occurs at the same time as part of the "flow").

nickpsecurity · on Aug 1, 2015

Exactly. You might like and keep handy this illustration from IBM's efforts. I think it nicely summarizes many of the tasks and issues in HW verification at various layers. At the least, it should give the impression to readers of how overwhelming the job can be without best-in-class tools. ;)

http://www.testandverification.com/DVClub/03_Jul_2014/DVClub...

I think we can do same for software, though. Just got to keep it simple, layered, and each layer building on one before it properly. I did it informally in a style that copied Wirth's Lilith work albeit special-purpose. Verisoft did quite a bit on full-stack for imperative. SAFE (crash-safe.org) is working on it for functional. I think a shortcut is to implement VLISP Scheme in hardware using hardware verification techniques along with previously verified I/O system. I've already seen LISP processors, VLISP for rigorous implementation, Shapiro made a security kernel, and the right hardware target can be reused for ML and Haskell code potentially. To counter hardware issues, run several in synch in same way as old Tandem NonStop architecture. Result should be flexible, fast enough for some workloads, enforce POLA, and have five 9's.

What you think of combining a verified LISP with hardware implementation as a time saver on goal to verification?

Note: Remember that, once we have that, building and verifying other toolchains is so much easier because we can work at high-level. Even highly-optimized systems such as yours could benefit from rigorously-verified systems maybe running same synthesis or checks overnight as a check against faster, possibly buggy implementations you use for iterations. Although, I mainly see them as a root-of-trust for other systems in network.

nickpsecurity · on Aug 1, 2015

Btw, thanks for the L3 tool. Might come in handy.

OliverJones · on Aug 2, 2015

1993 -- Intel Pentium floating point divide error.

Here's a joke from 1993. It's been a good year for Andy Grove, CEO of Intel. They've rolled out the Pentium and it's been a big success. So he walks into a bar and asks the bartender for a shot of 22-year old Glenmorangie Scotch to celebrate. The bartender puts the glass in front of him and says, "that's $20, sir."

Andy puts a twenty dollar bill on the counter, looks at it for a moment, and says "keep the change."

spacehome · on Aug 1, 2015

Seems more like a list of the software bugs with the most severe consequences.

engi_nerd · on Aug 2, 2015

So, what other measure of "worst" would you suggest?