Lessons from AWS S3 Outage: Safeguarding Your Cloud Storage

Credit: pexels.com, Computer server in data center room

The AWS S3 outage in 2017 was a wake-up call for many businesses, highlighting the importance of having a robust cloud strategy in place. Over 1,700 S3 buckets were affected, causing widespread disruptions.

The outage was caused by a team's attempt to fix a typo in a single S3 bucket's configuration. This small mistake had a ripple effect, causing a chain reaction that ultimately led to the outage.

One of the key takeaways from this incident is the importance of having a clear understanding of your cloud architecture and configuration. If the team had taken the time to review and test their changes, they might have caught the error before it caused a problem.

In the aftermath of the outage, AWS implemented new monitoring and alerting tools to help prevent similar incidents in the future. These tools can help detect and respond to issues before they become major problems.

Consider reading: What Caused Azure Outage

The 2017 Outage

The 2017 Outage was a significant event that highlighted the importance of redundancy and fail-safes in cloud storage systems. It occurred on February 28, 2017, and lasted for approximately 5 hours.

Credit: youtube.com, AWS S3 Outage 2017 Explained - Group2 Presentation

Amazon S3's US East region experienced a failure of its primary and secondary data centers, causing widespread outages and errors. The failure was attributed to a combination of human error and a complex system design.

The outage had a significant impact on businesses that relied on Amazon S3 for data storage, with some reporting losses of up to $150,000 per hour.

If this caught your attention, see: Apache Airflow Aws Data Pipeline S3 Athena

Lessons from the 2017 Outage

The 2017 Outage was a wake-up call for many companies, including Amazon, which experienced a significant failure in its S3 system.

Failure happens in complex systems, and it's not a matter of if, but when.

Detecting and paging engineers to fix issues in the moment is not the best option for consistent reliability.

We should plan for failure and prevent it from affecting customers, which is why testing and experimenting on systems is crucial.

Creating small failures in a controlled way allows us to learn from them and design automated work-arounds, redundancy, and fail-overs.

This approach helps eliminate bottlenecks and decreases the need for disaster recovery.

Gremlin's automated reliability platform empowers companies to find and fix availability risks before they impact users.

By using a platform like Gremlin, we can start finding hidden risks in our systems with a free 30-day trial.

A unique perspective: Onedrive Is Not Syncing Mac

A Typo Took Down the Internet

Credit: youtube.com, Typo caused Amazon’s big cloud-computing outage

A simple typo caused a massive outage on Tuesday morning, taking down S3, the backbone of the internet.

The S3 team was debugging the billing system and accidentally removed a larger set of servers than intended.

The servers that were taken offline supported two other S3 subsystems, which are crucial for data retrieval and storage tasks.

One of the subsystems manages metadata and location information of all S3 objects in the region.

Without it, services that depend on it couldn't perform basic data retrieval and storage tasks.

The massive restart took longer than expected due to S3's massive growth over the last several years.

Amazon is making changes to S3 to enable its systems to recover more quickly in the future.

Engineers will no longer be able to remove capacity from S3 if it would take subsystems below a certain threshold of server capacity.

The AWS Service Health Dashboard was also affected, showing all services running green during the outage.

Related reading: Aws Data Pipeline S3 Athena

This was because the dashboard itself was dependent on S3, which was down at the time.

Amazon is making a change to the dashboard so it will function properly next time S3 goes down.

The company apologized for the impact the outage caused for its customers and vowed to learn from the event to improve availability.

Explore further: Google Documents down

Response and Prevention

Amazon's public document details their response to the S3 outage, including updating their tool to prevent capacity from being removed when it would take any subsystem below its minimum required capacity level. They also audited other operational tools to ensure they had similar checks in place.

Amazon made changes to make subsystems recover quicker, and even reprioritized planned work to fix the issue sooner. They also changed the AWS Service Health Dashboard to run across multiple regions.

To prevent similar outages, it's essential to create redundancy across regions and automate failover schemes. This can help detect and reroute traffic to a different region the moment failure is detected in one region.

Explore further: Aws S3 Cross Region Replication

How They Responded to Details in Their Retrospective

Credit: pexels.com, Detailed view of a server rack with a focus on technology and data storage.

After a major outage, it's essential to take action and learn from the experience. Amazon took a proactive approach to communicating with affected businesses and customers, using alternative methods like Twitter until their dashboard was back up and running.

They also made significant changes to prevent similar issues in the future. Specifically, they updated their tool to remove capacity more slowly and added safeguards to prevent capacity removal when it would take a subsystem below its minimum required capacity level.

Amazon audited other operational tools to ensure they had similar checks in place. This is a crucial step in preventing similar outages, as it helps identify and address potential issues before they become major problems.

In addition to these changes, Amazon made other improvements to their systems. They made changes to make subsystems recover quicker, and they changed the AWS Service Health Dashboard to run across multiple regions.

A key takeaway from this outage is the importance of creating redundancy across regions for reliability. Most non-Amazon companies impacted would have significantly reduced their downtime if they had multi-region failover in place.

Here are some key changes Amazon made in response to the outage:

Updated tool to remove capacity more slowly
Added safeguards to prevent capacity removal
Audited other operational tools for similar checks
Made changes to make subsystems recover quicker
Changed AWS Service Health Dashboard to run across multiple regions

Reproducing and Experimenting with Gremlin for Prevention

Credit: youtube.com, SRE's Guide to Chaos & Observability with Ana Medina - Gremlin

To reproduce and experiment with Gremlin for prevention, start by testing your system's reliability under similar failure conditions. This can be done by running a scenario involving a Blackhole attack, which simulates the loss of a system or external dependency.

The Blackhole attack can be set up to begin with a small impact and small blast area, then grow larger to find the tipping point. A pre-made scenario called Unavailable Dependency in the Gremlin app does exactly what you want.

Your monitoring system should be gathering key metrics such as error rate and latency to help you understand where your system is reacting reliably and where it fails. This data will guide your improvement efforts.

Find the specific AWS provider you're accessing and run a Blackhole attack against it.

Curious to learn more? Check out: Azure Ticketing System

Recent Issues

Amazon S3 has experienced several outages and issues in recent times. The most notable incident occurred on October 8, 2024, where the service was impacted by an operational issue that lasted about 24 hours.

Credit: youtube.com, AWS S3 Is Having Some Serious Issues...

The maximum impact level to Amazon Simple Storage Service was an Informational message. This incident affected multiple services. For more information, see the AWS Health Dashboard.

On October 7, 2024, there was another incident where Amazon Simple Storage Service in us-east-2 was impacted by an event that affected multiple services. The impact lasted for 15 minutes.

In July 2024, there were two separate incidents. On July 31, 2024, Amazon Simple Storage Service in us-east-1 was impacted by an event that affected multiple services. The impact lasted for 6 minutes.

However, on the same day, July 31, 2024, there was another incident where Amazon Simple Storage Service in us-east-1 was impacted by an operational issue that lasted for 1 day. The maximum impact level to Amazon Simple Storage Service was again an Informational message.

Here are the details of the recent incidents:

October 8, 2024: Service impacted for about 24 hours
October 7, 2024: Service impacted for 15 minutes
July 31, 2024: Service impacted for 6 minutes
July 31, 2024: Service impacted for 1 day

Frequently Asked Questions

What is causing AWS outage?

An internal issue caused an unexpected behavior, triggering a Distributed Denial of Service (DDoS) attack on AWS's internal network, leading to the outage

Sources

Ismael Anderson

Lead Writer

View Ismael's Profile

Ismael Anderson is a seasoned writer with a passion for crafting informative and engaging content. With a focus on technical topics, he has established himself as a reliable source for readers seeking in-depth knowledge on complex subjects. His writing portfolio showcases a range of expertise, including articles on cloud computing and storage solutions, such as AWS S3.

View Ismael's Profile

AWS S3 Outage Lessons for a Safer Cloud