Safe Systems from Unreliable Parts
written by Walter Bright
October 29, 2009
A recent article in Wired [1] piqued my interest because it touched on an interesting subject. In it, a medical “gamma knife” used to treat cancer patients had suffered from a software glitch. The emergency shutoff switch had no effect, causing the staff to have to run in and extract the patient before he was killed.
How does one create a safe machine? The first thought one has is to make the machine perfect so it cannot fail. Evidently, the designers of that gamma knife followed this principle, hoping their software is perfect and following up on imperfections with bug fixes.
The problem with that approach, of course, is it is impossible to create a component that cannot fail. As programmers, we all know it is impossible to create bug-free software. Even if no expense was spared, perfection can only be asymptotically approached, at exponentially increasing costs. Even if the software was perfect, there could be a hardware failure that corrupts the software.
How can a safe system be created using flawed, unreliable parts, at a practical cost? I used to work for Boeing on flight critical systems design, and received quite an education on this. Boeing is, of course, spectacularly successful at making an inherently unsafe activity, flying, astonishingly safe. They do this by acknowledging the principle:
Any component can fail, at any time, in the worst conceivable manner.
Now, what to do about that?
Any critical compononent or system must have a backup.
Let’s illustrate this with a bit of math. Given component A that has a 10% failure rate, we need to get it down to 1%. Improving the quality of that component by a factor of 10 will get us there, but at a cost explosion of 10 times the price. But suppose we add in a backup component B, that also has a 10% failure rate. The odds of A and B both simultaneously failing are 10% of 10%, or 1%. This is achieved by a mere doubling of the cost instead of an order of magnitude increase. The reliability can be further improved to 0.1% by adding another backup component C with 3 times the cost, instead of a hundred times.
In order for this to work, though, A and B (and C) must be completely independent of each other. A failure in one cannot propagate to the other, and the circumstances that cause one to fail must also not cause the other to fail. This independence is where the hard work comes in.
In the gamma knife example (I am inferring based on the scanty information in the article) the emergency shutoff switch relied on the same software as the rest of the system did, so when the software crashed, the failure propagated to the shutoff system, and that didn’t work either.
A shutoff system that was not coupled to failures in the software could be one that simply turned off all power to the machine. The gamma radiation source would be automatically blocked upon power failure by shields normally held back when power is on with electromagnets.
Unfortunately, for the makers of the gamma knife and in resolutions of earlier problems with similar machines [3] the focus instead was on rolling out bug fixes and similar attempts to make the software perfect. Reading the comments on the article [1] also show a focus on trying to make the software perfect. While it is worthwhile to try to make the software better and less buggy, the point is it is not the path to making safe systems, because it is impossible to make perfect software.
Conclusion
Reliance on writing perfect software and rolling out bug fixes to correct any imperfections is a fundamentally unsound and unsafe approach. A safe system is one with a backup or failsafe shutdown that is independent and completely decoupled from the primary system. Redundancy is not enough, decoupling is required as well in order to prevent the failure of one system propagating to the backup.
In the next installment, I’ll talk about incorporating some of these ideas into software to improve the reliability of that software.
References
- ‘Known Software Bug’ Disrupts Brain-Tumor Zapping
- History’s Worst Software Bugs
- An Investigation of the Therac-25 Accidents
Acknowledgements
Thanks to Bartosz Milewski, Brad Roberts, David Held, and Jason House for reviewing this.