The AWS S3 outage in 2017 was a wake-up call for many businesses, highlighting the importance of having a robust cloud strategy in place. Over 1,700 S3 buckets were affected, causing widespread disruptions.
The outage was caused by a team's attempt to fix a typo in a single S3 bucket's configuration. This small mistake had a ripple effect, causing a chain reaction that ultimately led to the outage.
One of the key takeaways from this incident is the importance of having a clear understanding of your cloud architecture and configuration. If the team had taken the time to review and test their changes, they might have caught the error before it caused a problem.
In the aftermath of the outage, AWS implemented new monitoring and alerting tools to help prevent similar incidents in the future. These tools can help detect and respond to issues before they become major problems.
The 2017 Outage
The 2017 Outage was a significant event that highlighted the importance of redundancy and fail-safes in cloud storage systems. It occurred on February 28, 2017, and lasted for approximately 5 hours.
Amazon S3's US East region experienced a failure of its primary and secondary data centers, causing widespread outages and errors. The failure was attributed to a combination of human error and a complex system design.
The outage had a significant impact on businesses that relied on Amazon S3 for data storage, with some reporting losses of up to $150,000 per hour.
Lessons from the 2017 Outage
The 2017 Outage was a wake-up call for many companies, including Amazon, which experienced a significant failure in its S3 system.
Failure happens in complex systems, and it's not a matter of if, but when.
Detecting and paging engineers to fix issues in the moment is not the best option for consistent reliability.
We should plan for failure and prevent it from affecting customers, which is why testing and experimenting on systems is crucial.
Creating small failures in a controlled way allows us to learn from them and design automated work-arounds, redundancy, and fail-overs.
This approach helps eliminate bottlenecks and decreases the need for disaster recovery.
Gremlin's automated reliability platform empowers companies to find and fix availability risks before they impact users.
By using a platform like Gremlin, we can start finding hidden risks in our systems with a free 30-day trial.
A Typo Took Down the Internet
A simple typo caused a massive outage on Tuesday morning, taking down S3, the backbone of the internet.
The S3 team was debugging the billing system and accidentally removed a larger set of servers than intended.
The servers that were taken offline supported two other S3 subsystems, which are crucial for data retrieval and storage tasks.
One of the subsystems manages metadata and location information of all S3 objects in the region.
Without it, services that depend on it couldn't perform basic data retrieval and storage tasks.
The massive restart took longer than expected due to S3's massive growth over the last several years.
Amazon is making changes to S3 to enable its systems to recover more quickly in the future.
Engineers will no longer be able to remove capacity from S3 if it would take subsystems below a certain threshold of server capacity.
The AWS Service Health Dashboard was also affected, showing all services running green during the outage.
This was because the dashboard itself was dependent on S3, which was down at the time.
Amazon is making a change to the dashboard so it will function properly next time S3 goes down.
The company apologized for the impact the outage caused for its customers and vowed to learn from the event to improve availability.
Response and Prevention
Amazon's public document details their response to the S3 outage, including updating their tool to prevent capacity from being removed when it would take any subsystem below its minimum required capacity level. They also audited other operational tools to ensure they had similar checks in place.
Amazon made changes to make subsystems recover quicker, and even reprioritized planned work to fix the issue sooner. They also changed the AWS Service Health Dashboard to run across multiple regions.
To prevent similar outages, it's essential to create redundancy across regions and automate failover schemes. This can help detect and reroute traffic to a different region the moment failure is detected in one region.
How They Responded to Details in Their Retrospective
After a major outage, it's essential to take action and learn from the experience. Amazon took a proactive approach to communicating with affected businesses and customers, using alternative methods like Twitter until their dashboard was back up and running.
They also made significant changes to prevent similar issues in the future. Specifically, they updated their tool to remove capacity more slowly and added safeguards to prevent capacity removal when it would take a subsystem below its minimum required capacity level.
Amazon audited other operational tools to ensure they had similar checks in place. This is a crucial step in preventing similar outages, as it helps identify and address potential issues before they become major problems.
In addition to these changes, Amazon made other improvements to their systems. They made changes to make subsystems recover quicker, and they changed the AWS Service Health Dashboard to run across multiple regions.
A key takeaway from this outage is the importance of creating redundancy across regions for reliability. Most non-Amazon companies impacted would have significantly reduced their downtime if they had multi-region failover in place.
Here are some key changes Amazon made in response to the outage:
- Updated tool to remove capacity more slowly
- Added safeguards to prevent capacity removal
- Audited other operational tools for similar checks
- Made changes to make subsystems recover quicker
- Changed AWS Service Health Dashboard to run across multiple regions
Reproducing and Experimenting with Gremlin for Prevention
To reproduce and experiment with Gremlin for prevention, start by testing your system's reliability under similar failure conditions. This can be done by running a scenario involving a Blackhole attack, which simulates the loss of a system or external dependency.
The Blackhole attack can be set up to begin with a small impact and small blast area, then grow larger to find the tipping point. A pre-made scenario called Unavailable Dependency in the Gremlin app does exactly what you want.
Your monitoring system should be gathering key metrics such as error rate and latency to help you understand where your system is reacting reliably and where it fails. This data will guide your improvement efforts.
Find the specific AWS provider you're accessing and run a Blackhole attack against it.
Recent Issues
Amazon S3 has experienced several outages and issues in recent times. The most notable incident occurred on October 8, 2024, where the service was impacted by an operational issue that lasted about 24 hours.
The maximum impact level to Amazon Simple Storage Service was an Informational message. This incident affected multiple services. For more information, see the AWS Health Dashboard.
On October 7, 2024, there was another incident where Amazon Simple Storage Service in us-east-2 was impacted by an event that affected multiple services. The impact lasted for 15 minutes.
In July 2024, there were two separate incidents. On July 31, 2024, Amazon Simple Storage Service in us-east-1 was impacted by an event that affected multiple services. The impact lasted for 6 minutes.
However, on the same day, July 31, 2024, there was another incident where Amazon Simple Storage Service in us-east-1 was impacted by an operational issue that lasted for 1 day. The maximum impact level to Amazon Simple Storage Service was again an Informational message.
Here are the details of the recent incidents:
- October 8, 2024: Service impacted for about 24 hours
- October 7, 2024: Service impacted for 15 minutes
- July 31, 2024: Service impacted for 6 minutes
- July 31, 2024: Service impacted for 1 day
Frequently Asked Questions
What is causing AWS outage?
An internal issue caused an unexpected behavior, triggering a Distributed Denial of Service (DDoS) attack on AWS's internal network, leading to the outage
Sources
- https://www.gremlin.com/blog/the-2017-amazon-s-3-outage
- https://www.theverge.com/2017/3/2/14792442/amazon-s3-outage-cause-typo-internet-server
- https://statusticker.com/s/Amazon-S3
- https://www.eginnovations.com/documentation/Monitoring-AWS-EC2-Cloud/AWS-S3-Service-Status-Test.htm
- https://www.catchpoint.com/blog/aws-s3-outage-2017
Featured Images: pexels.com