Press "Enter" to skip to content

AWS tells us what happened with that outage

And while we’re on internet-stopping issues, Amazon Web Services has explained what happened with their recent outage.      An automated process in the Northern Virginia, US-EAST-1 region, caused the problem.  

An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” Amazon’s report says. “This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.”

According to the report, this issue even impacted Amazon’s ability to see what exactly was going wrong with the system. It prevented the company’s operations team from using the real-time monitoring system and internal controls they typically rely on, explaining why the outage took so long to fix.

Why do we care?

I’m pondering a question about monitoring technologies related to Log4Shell, and so this after a report from AWS was very timely.  

Who monitors the monitor?      This is precisely what happened to Amazon – their monitoring systems relied on the systems being monitored, so when there was a problem…. Blindness.  

The takeaway is that design element – maybe monitoring should be its own thing and perhaps read-only.