Not So Fastly

The system failure at Fastly on June 8, 2021 portends what may well be the greatest threat to the Internet and all that it supports. In an Associated Press article by Marcio Jose Sanchez on June 9, 2021, with the title “Tuesday’s Internet Outage Was Caused By One Customer Changing A Setting, Fastly Says,” it was disclosed that the outage, which barred access to such websites as The New York Times, The Guardian, etc., was caused by a software “bug” that was activated when a customer made a legitimate change to their settings. The article is available at Fastly Says Internet Outage Was Caused By One Customer Changing A Setting : NPR

Having been there myself, I can sympathize with the Fastly folks who were responsible for testing the new release of the software guilty of taking down parts of Fastly’s infrastructure. Many new systems and changes to existing production systems are so complex and may be used in so many ways in a multitude of environments that exhaustive testing, even if the testing is fully automated will likely not cover all possible cases. There are so many combinations and permutations that there are potentially many millions of use cases, any one or more of which could create problem conditions. If you wish to test secondary and tertiary scenarios, there could literally be billions of possibilities. What I did in such circumstances was to suggest a sampling approach and keep testing depending on whether any errors were detected. The subject system ran for months without a hitch, then one day an error showed up. In order to reach that part of the code, a user would have to follow several specific (and unusual) steps—and somebody did!

It is actually quite surprising that these types of error don’t seem to happen more frequently given the size and scope of modern software systems. But the sinister side of this is expressed in Sanchez’s article in the closing sentence: “… the [Fastly] incident highlighted how … much of the global internet is dependent on a handful of behind-the-scenes companies … that provide vital infrastructure, and it amplified concerns about how vulnerable they are to more serious disruption.”

My takeaway from this incident is validation of my ongoing, and growing, concern that a cyberpandemic might originate from an error in software or from nefarious acts not intended to damage the Internet, only exploit it. After all, bad actors have little-to-no incentive to take down the Internet, since they would close down the conduit via which they will be compensated.

As organizations become increasingly intertwined through dependence on third parties—complex supply chains, transfer of systems to the Cloud—the chances that an error or omission might run rampant across the Internet in a matter of seconds increases dramatically. There are certainly justifiable business reasons to outsource to third parties, but it is questionable whether decisionmakers are taking into account the enhanced risk of such arrangements. It is only when something really bad happens that attention is paid to the problem as illustrated by the current pandemonium over ransomware.

The more we depend on third parties, the less is our systems’ resiliency. This is being demonstrated time and again by the expanding scope of cyber exploits, malfunctions and failures. It is next to impossible to force entities to disconnect from the Internet or slow down releases of new software versions, with their great new features, to accommodate more testing, but there may not be a choice at some point. It is far better to anticipate and prepare than to react and respond.

Post a Comment

Your email is never published nor shared. Required fields are marked *