Another Boeing Software “Glitch”

How I hate the word “glitch,” which is commonly used to describe faulty software in press reports, blogs, and the like. In my opinion, it trivializes serious software errors.

So, when the word “glitch” showed up on the front page of the January 18-19, 2020 Wall Street Journal, as in “Boeing Finds Another Software Problem: Glitch adds to string of technical issues delaying return pf 737 MAX to service,” written by Andy Pasztor, I thought, “Here we go again.”

The Y2K issue was often referred to as a “glitch,” but in reality, it was serious multi-hundreds-of-billion-of-dollars software issue that threatened to take down companies, government agencies, critical infrastructures, nations and whatever else running on legacy software that did not account for the century rollover.

Well, the Boeing problem is not a trivial one either, and the WSJ article assures us that it is a very severe problem having to do with booting up their flight-control systems. But that isn’t the focus of this column. The focus is on the software assurance methodology described as being used by Boeing. As the article states:

“The software problem occurred as engineers were loading updated software … into the flight-control computers of a test aircraft … A software function intended to monitor the power-up process didn’t operate correctly … resulting in the entire computer system crashing. Previously, proposed software fixes had been tested primarily in ground-based simulators, where no power-up problems arose …”

It was when the software was tested in a real-world aircraft that the issue became painfully apparent.

Quite early in my career, I experienced a somewhat analogous situation (though without the human-safety risk) when installing the first digital trader telephone turret in the Eastern U.S. Traders used these turrets primarily to be able to talk to other traders instantaneously (even faster than via auto-dial), the idea being that making contact as quickly as possible favored traders with this capability.

While the subject turret system worked perfectly in the lab, our system kept crashing. The vendor could not account for this happening over the course of several weeks, much to the consternation of our traders, senior management, my staff and me. Eventually the root cause was ascertained. It turned out that in the lab the data cables were carefully installed, ran only short distances, and were noise-free. Our field installation was bigger and engineered to industry standards, which were not as precise as those achieved in the lab. When the system that we installed experienced noise on the line, the software switched to an error-handling routine that should have kept the system up and running despite intermittent noise. However, because the vendor’s engineers had never experienced noise in the lab environment, they never got to test that routine. And (wouldn’t you just know it?) there was a “bug” in that routine, which caused the system to crash repeatedly.

The lesson here is that you can do all the testing that you want in the lab, but the ultimate tests are those that take place in the field. In various contexts, systems behave differently from when they are in the lab or other well-controlled environments. For security-critical and safety-critical systems, you have to test under all known potential conditions. I address many of these issues in greater detail in my book “Engineering Safe and Secure Software Systems” (Artech House).

As we increase the number and power of cyber-physical systems, especially in such areas as autonomous road vehicles, it is ever more important to test extensively, not only in the lab or in simulated environments, but also under real-world conditions. This is because you can never be absolutely sure that test environments truly duplicate the environments in which the systems will actually operate.

By the way, my RSA Conference presentation, in March 2010, with the title “Data Collection  and Analysis Issues with Application Security Metrics,” specifically emphasized the importance of context when it comes to application security, as does my BlogInfoSec column “Putting Application Security into Context” dated January 12, 2015 and available at Context is everything when it comes to the safe and secure operation of applications software and the sooner that reality is understood by software designers, developers and testers, the better off we all will be.

Coincidentally, MIT professor Nancy Leveson just published a must-read article: “Inside Risks: Are You Sure Your Software Will Not Kill Anyone? Using software to control potentially unsafe systems requires the use of new software and system engineering approaches.” The article, which appears in the February 2020 issue of Communications of the ACM, addresses a number of the situations mentioned above plus others, as well as giving several real-world examples of where lab or simulation testing did not properly account for certain environments encountered in the field.

Post a Comment

Your email is never published nor shared. Required fields are marked *