A single point of failure triggered the Amazon outage affecting millions
Briefly

A single point of failure triggered the Amazon outage affecting millions
"In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center."
"The affected US‑EAST‑1 is AWS's oldest and most heavily used hub. Regional concentration means even global apps often anchor identity, state or metadata flows there. When a regional dependency fails as was the case in this event, impacts propagate worldwide because many "global" stacks route through Virginia at some point. Modern apps chain together managed services like storage, queues, and serverless functions."
DNS state propagation delays caused instability in an AWS network load balancer and produced connection errors originating from the US‑East‑1 region. Amazon disabled the DynamoDB DNS Planner and DNS Enactor automation worldwide while engineers fix a race condition and add protections to prevent incorrect DNS plans, and they are making changes to EC2 and its network load balancer. A heavy concentration of customers routing through US‑East‑1 amplified impact because many global stacks anchor identity, state, or metadata there and cannot easily route around the region. Cascading failures affected Redshift, Lambda, Fargate, Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center, underscoring the need to eliminate single points of failure and design for contained failure.
Read at Ars Technica
Unable to calculate read time
[
|
]