Amazon’s Outage: Architecting for Failure in the Cloud


The uproar surrounding the partial outage of Amazon’s EC2 cloud services platform got some new life last Friday, when the company released a detailed post mortem of the incident. The summary includes a surprising level of detail into the root cause of the outage, information on a service credit for impacted customers, and, finally, an apology from Amazon.

To recap: on April 21st, Amazon made a configuration change during a network upgrade that caused a cascading series of events that resulted in what it calls a “re-mirroring storm.” As a consequence, Amazon’s storage service was essentially “stuck” and unable to locate new storage space for either new or existing customers. This led to a significant period of degraded functionality and downtime for major Web 2.0 sites such as Reddit, Foursquare and HootSuite, as well as a plethora of bad PR for Amazon, a cloud computing pioneer.

Interestingly, though, some other sites were only moderately affected, or not affected at all, most notably Netflix. Why did some sites sink, while others sailed through the storm? As with any complex system, there is no one answer. However, organizations that “architected for failure” tended to fare better than those did not. Netflix, which recently released its lessons learned from the outage, built its platform around the assumption that services and/or zones within EC2 could be unavailable for extended periods of time.

Clearly, not every EC2 customer has the technical chops to build Netflix-like applications, and Amazon needs to make it easier to increase redundancy by taking advantage of multiple availability zones. However, this is a public cloud platform, and customers that did not take full advantage of Amazon’s redundant architecture, or that did not create their own replicated solutions, ended up paying the price.

While architecting for failure in the cloud may appear to be a purely IT responsibility, it’s not. ISG views cloud as one component of your service delivery strategy, and regardless if they are delivered in-house or are outsourced, effective planning across all components is vital.

Business continuity planning (BCP) and Disaster Recovery (DR) are two critical parts of this planning process. Unlike a traditional outsourced agreement, in which the supplier takes on a significant level of responsibility in the delivery of the BCP and DR plans, the public cloud requires that customers retain a significant amount, if not all, of this responsibility, as well as the associated risk. In return they get a highly scalable, cost-effective and fast-to-provision computing platform.

Bottom line: Some organizations are finding out the hard way what happens when they don’t integrate cloud with their overall service delivery strategy. By including business continuity planning and disaster recovery in the cloud architecture design process, enterprises can significantly reduce the risk of business disruption.

About the author

Stanton helps enterprise IT and sourcing leaders rationalize and capitalize on emerging technology opportunities in the context of the global sourcing industry. He brings extensive knowledge of today’s cloud and automation ecosystems, as well as other disruptive trends that are helping to shape and disrupt the business computing landscape. Stanton has been with ISG for more over a decade. During his tenure he has helped clients develop, negotiate and implement cloud infrastructure sourcing strategies, evaluate and select software-as-a-service platforms, identify and implement best-in-class service brokerage models, and assess how the emerging cloud master architecture can be leveraged for competitive advantage. Stanton has also guided a number of leading service providers in the development of next-generation cloud strategies. Stanton is a recognized industry expert, and has been quoted in CIOForbes and The Times of London. You can follow Stanton on Twitter: @stantonmjones.