As the recent reporting demonstrates, an outage in cloud computing does in fact let the sun shine through … through to the realization that the cloud is not totally reliable and resilient. The recent experiences with Amazon and Google serve to illustrate this, although there have been quite a number of other significant prior incidents.
A good overview of the cause and impact of the Amazon outage is given on page 31 of the May 16, 2011 issue of InformationWeek in the form of a sidebar by Charles Babcock with the title “When Amazon’s Cloud Turned On Itself.” The writer was particularly surprised that Amazon’s cloud services were in fact susceptible to human error, since he presumed that “clearly obvious errors has been anticipated, with defenses in place, automated checks….” Babcock suggests what should be done to avoid this particular problem in the future and so maintain faith in cloud computing. But, even if the changes were implemented, would such faith be justified?
When it comes to complex systems and networks, it is apparent that not only are some errors and bugs not readily anticipated, but their resolution will not have been fully incorporated into automated systems. This is because human beings, who design, implement and operate those automated controls, will always be fallible to the extent that they cannot predict every possible negative outcome. The question at hand is whether they should have expanded the testing of the controls to account for additional scenarios. This is the usual trade-off of the costs and time delays of such testing versus the reduction in the risk of failure, which I have discussed many times previously.