I want to revisit the outage from earlier this week. I don’t cover outages… because by the time you hear about them here, they’re over. But let’s discuss this one, and what happened.
Tuesday morning, Fastly, a content delivery network, had an outage for 49 minutes. Sites were offline during the outage with different results, such as a 503 error or images and formatting not working.
What happened? Turns out it was an undiscovered software bug, activated when one customer made a change to their configuration. The company indicated they were aware of the problem “within one minute” and restored service to 95 percent of the network within 49 minutes.
Why do we care?
Two reasons I want to discuss. First, most of those within the technical operation of the internet praised the company for both it’s fast response and it’s effective communication, both during the incident as well as after the fact with an incident report. Lesson One: Communications matter. A lot.
The second is broader – and I’m referencing a Protocol article here. Other portions of the news coverage were about the perceived concentration of power in a small number of companies. And as the article points out, that’s something of a red herring. The internet DOES have chokepoints and software has bugs. Those aren’t the issues to worry about. This service operated by design in both cases – it executed well on handling it’s issue, but is also both a choice to use by providers AND has competitors. It’s up to us, technologists, to talk about it correctly.