Computer software, we’re told, now owns the world. Virtually everything contains it, from cars to toasters. And, as we all know only too well, sometimes it goes very, very wrong, as it did with the Boeing 737 MAX 8 and its Maneuvering Characteristics Augmentation System (MCAS) software in the Ethiopian Airlines crash in March, killing 157 people.
Sometimes the issue is a bug in the program, sometimes it’s a design flaw, and sometimes it’s precipitated by a seemingly unrelated decision that caused software performing exactly as designed to do terrible things.
Bob Tapscott, chief designer and architect of the Jeppesen Aviation Database (JAD) that supplies flight directions to flight management systems, believes that a series of cascading failures caused the 737 MAX crashes. The problem started with the MCAS system and the sensors that provided its input, but, “In a nutshell, the man-machine interface design was a killer,” he said.
Added software engineer Trevor Sumner, CEO of Perch Interactive, in a Twitter thread, “Nowhere in here is there a software problem. The computers & software performed their jobs according to spec without error. The specification was just shitty. Now the quickest way for Boeing to solve this mess is to call up the software guys to come up with another band-aid.”
Boeing plans software update to address issue
Boeing’s official statement noted that while an official report was forthcoming, it was proceeding with a software update “to provide additional layers of protection if the angle of attack (AOA) sensors provide erroneous data,” as well as providing training materials requested by pilots. The software update would include:
- Flight control system will now compare inputs from both AOA sensors. If the sensors disagree by 5.5 degrees or more with the flaps retracted, MCAS will not activate. An indicator on the flight deck display will alert the pilots.
- If MCAS is activated in non-normal conditions, it will only provide one input for each elevated AOA event. There are no known or envisioned failure conditions where MCAS will provide multiple inputs.
- MCAS can never command more stabilizer input than can be counteracted by the flight crew pulling back on the column. The pilots will continue to always have the ability to override MCAS and manually control the airplane.
Regardless of the reasons for failures, software quality assurance (QA) – the exercise of testing programs to ensure they function as designed and don’t contain critical bugs – is coming under scrutiny. What’s the right standard for which companies can test their software?
According to the Test Institute, which certifies software testers, there are two approaches. The defect management approach looks at not only the quality of the code, but its adherence to the design specification. The quality attributes approach, on the other hand, focuses on six broad areas: Functionality, reliability, usability, efficiency, maintainability, and portability. The Test Institute notes on its site: “The entire focus of Quality Assurance is on implementation of processes and procedures that are required for the verification of the software under development and the requirements of the client.”
However, in the case of the 737 MAX, there are other issues. “Broadly speaking there is a fundamental issue I have seen from aviation to Wall Street. It is a far more interesting, and typically financially more rewarding to create software, be it for risk-management and financial derivatives products or for risk-mitigation and flight management systems, than it is to independently audit what others have created,” Tapscott noted. “As long as this remains the case, expect more problems. Even if the compensation issues are addressed, it takes a certain mindset to enjoy finding other people’s mistakes, more than creating something new. So be it as a result of lack of funding or the right people, or cultural issues, much of these new 737 Max jets were ‘self-certified’ by Boeing. I am sure that is an issue that will be seriously re-thought.”
Making software QA effective
Aja Hammerly, developer relations advocate at Google Cloud, worked in software QA for seven years before moving into other roles, and describes the processes required for effective software QA.
“First thing you want to do is you want to figure out, what does operational look like, and figure out what the risks of things going wrong are so that you can prioritize your testing based on high-risk areas first, usually, or at least ensuring that your high-risk areas are sufficiently covered,” she said. “And those risks can depend on the system, which also means the fundamental QA is looking in looking deeply at the system, not just the spec but of actual implementation.”
In her experience, most of the “interesting and potentially troublesome” issues are found in integration testing, where systems come together. Those systems may work perfectly on their own, but when they’re put together bad things happen. Consider, for example, the $125 million Mars Climate Orbiter that was lost in space in 1999 because Lockheed Martin’s engineers used Imperial measures (miles) when programming navigational commands for the probe, while NASA used the metric system (kilometers) to issue those commands from its systems, and nobody noticed. The JPL review panel that looked at the loss concluded that it was an integration QA failure in which the process that should have caught the issue went wrong.
In the case of the 737 MAX, Tapscott said, “One decision Boeing made, that seems very difficult to understand is why some safety features were an option. In my experience I did not ever meet anyone in the industry, that felt that life-saving features might be optional. That violates an unwritten code, and to me is still shocking. No doubt it was bundled in with other features less flight-critical, but still to me seems difficult to understand.”
Most QA time is not spent on what Hammerly calls the “happy path” – the cases in which everything works properly. Instead, she said, 80 – 90 per cent of her time was spent on edge cases.
“All the other stuff is being: ‘okay, well, what happens if this goes wrong? What happens if this goes wrong? What happens if these 19 things go wrong simultaneously?’ Because most folks at this point are actually really good at catching the one or two cases where something goes wrong, or two things going wrong simultaneously. Where we see errors is where we have three or four things,” she said. “Today, processes need to figure out what all those are and then prioritize them; I prefer prioritizing them based on risk.”
But, she noted, risk means different things to different industries. It could be financial risk, or it could be physical risk, or it could be something else. It appears that the Boeing issues come under the latter category.
“What is also disconcerting about these events is the degree to which the FAA and Boeing seemed interdependent. Historically the FAA was by far the most cautious and prudent regulator and would have been the first to ground the airplane when unexplained fatalities in a new airliner occurred. In this instance, the FAA was the very last to do so. Clearly, the roles of the manufacturer and the regulator lacked the independence we all expect,” Tapscott said. “Where the man and the machine so strongly disagreed on the likelihood of a stall, the software could have asked for a second opinion say from the GPS, that would have said at 450 knots at that altitude a stall is most unlikely. Whatever the fix is, I expect it to be slow, given that this time the FAA will have concluded that self-certification is no longer an option.”
Boeing says the AOA Disagree alert wasn’t necessary for flight safety.