
"The race condition occurred when one DNS Enactor experienced "unusually high delays" while the DNS Planner continued generating new plans. A second DNS Enactor began applying the newer plans and executed a clean-up process just as the first Enactor completed its delayed run. This clean-up deleted the older plan as stale, immediately removing all IP addresses for the regional endpoint and leaving the system in an inconsistent state that prevented further automated updates applied by any DNS Enactors."
"Before manual intervention, systems connecting to DynamoDB experienced DNS failures, including customer traffic and internal AWS services. This impacted EC2 instance launches and network configuration, the postmortem says. The DropletWorkflow Manager (DWFM), which maintains leases for physical servers hosting EC2 instances, depends on DynamoDB. When DNS failures caused DWFM state checks to fail, droplets - the EC2 servers - couldn't establish new leases for instance state changes."
A race condition in DynamoDB's automated DNS management left an empty DNS record for the US-EAST-1 regional endpoint and triggered increased API error rates starting 11:48 PM PDT on October 19. The DNS system uses a DNS Planner that generates plans and DNS Enactors that apply changes via Route 53. One Enactor experienced unusually high delays while the Planner continued producing new plans; a second Enactor applied newer plans and performed a clean-up that deleted the older plan, removing all regional IP addresses and leaving the system inconsistent. Automated DNS updates then failed, breaking DNS resolution for customer and internal services, impacting EC2 launches, network configuration, and DWFM lease checks so droplets could not establish new leases, resulting in a day-long outage and extensive damage.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]