S3 outages and AWS availability vs. durability

The latest Amazon Web Services outage put a spotlight on the availability vs. durability of AWS by showing just how pervasive the platform has become. Today, it runs such a vast portion of the world’s publicly accessible websites that any downtime is noticeable. In fact, some estimates suggest that as much as 20% of the internet was affected by the multi-hour outage.

One of the affected services, AWS Simple Storage System (S3), has become such a steadfast, reliable and highly-used offering that many of those who build and deploy AWS solutions may have taken it for granted and, as a result, forgotten some of the foundations of good infrastructure design, like planning for S3 redundancy across regions or availability zones.

The S3 outage revealed their mistake.

The Value of Amazon S3 Data Centers

S3 is an extremely robust and affordable AWS storage service that is designed to provide exceptional levels of durability and availability. The service was launched by AWS in 2006, very early in its campaign to dominate the underlying infrastructure running much of the internet. Since then, it has fundamentally changed the way many developers contemplate storing data to make it accessible via the web or via enterprise applications running on the platform.

Even for those entities still managing a good portion of their web hosting infrastructure in traditional data centers, an Amazon S3 data center provides a welcome new way of making files available across the internet. The idea of purchasing hard drives to store application data that is to be publicly accessible is often simply no longer a consideration.

Taking S3 Reliability for Granted

Many web developers and development firms rely on S3. The service has an impeccable track record of availability, eliminates much of the administrative overhead associated with storing files on disks in a local or co-located datacenter, and offers new and innovative ways to access data. Coupled with the ability to store virtually unlimited data on a platform the grows and shrinks without needing to provision disks that may or may not be well-utilized, S3 is a compelling choice for storing Internet-accessible data.

While yesterday’s outage may have thrust AWS into a spotlight they would likely rather not have been in, it also shed light on the fact that firms deploying solutions on top of AWS gain strength from the roots of good design principles. The risk associated with S3 offering such a compelling list of features and having a reputation of being a rock-solid platform for data storage is complacency; over time, many come to simply rely on the fact that Amazon has built such a solid offering. Throughout the day yesterday, we heard grumblings from all corners of the Internet that this outage was so shocking and disappointing because AWS has often touted the durability of this platform.

Balancing Availability vs. Durability on AWS S3

The AWS S3 SLA calls for 99.999999999% durability — that is a bunch of nines. But the difference between availability and durability is important. For sites that simply cannot tolerate downtime, or having certain components of their site unavailable for even a moment, it is critical and very feasible to design solutions that can deal with outages like we saw yesterday.

AWS S3 durability is built in, vs. availability, which requires designing a fault-tolerant infrastructure. It takes a certain level of expertise and experience to do the later. Having the right partner and/or employees in place can put firms on a path of building a resilient platform for hosting their most valuable web properties and enterprise applications.

The Cost of Full Amazon S3 Reliability

In addition, firms need to consider the financial ramifications of building such an infrastructure. Striking a balance between ongoing infrastructure spend and the cost of downtime requires a deep dive into the real costs of availability vs. durability. This is not always simple, but it should absolutely take place before jumping headfirst into any solution — AWS or otherwise. The cost of refactoring an application after an outage like we experienced yesterday can far exceed the data durability calculation of investing in making these decisions upfront and in an informed manner.

Measuring Amazon S3 Redundancy

Deft is not impervious to these outages and we are regrouping this morning to take our own advice. We are assessing our own data durability after yesterday’s outage and doing everything we can to make sure our infrastructure design practices match the needs of all our clients.

We plan to take a look at the data surrounding this outage in order to truly understand the impact and ramifications. Though yesterday was a rough day, the AWS track record over the long-term provides evidence of past reliability that still remains. The stated Amazon S3 SLA for Uptime or Availability caps at 99.99%, and AWS hasn’t experienced meaningful downtime in this way in recent memory. Some sporadic hiccups are inevitable, but yesterday made news because this was a very unexpected, rare, and relatively unprecedented outage for AWS.

The Value of AWS S3 Durability Is Still Strong

AWS has provided a platform for technical acceleration not seen since the days when Bill Gates transitioned Microsoft from Windows 3.1 to Windows 95; they are leading the charge in ushering in a new renaissance in technology and computing and providing a platform that short circuits the time to market for some of the most compelling and innovative new products and services seen in quite some time — possibly ever.

Did yesterday suck? Yep. Will there be other outages? Inevitably. Is it a small price to pay for what AWS has provided in such a short amount of time? We think so.