Two major IT providers suffered service problems this morning, causing CIOs and CISOs hours of grief.
A huge outage affected more than a dozen of content provider Cloudflare’s data centers, which affected a large number of major websites. It began around 2:34 a.m. Eastern time and was reported by the company to be resolved about an hour and a half later.
Ironically, the problem was caused by Cloudflare making a change to increase its resiliency.
Meanwhile the cloud-based Microsoft 365 service also reported outages. Around 6 a.m. Eastern the company tweeted that it was investigating complaints some users were experiencing delays or connection issues when accessing the Exchange Online service. That expanded to the realization that multiple Microsoft 365 services were experiencing delays, connection and search issues. The fault was in the traffic management infrastructure “not working as expected,” the company said around 8 a.m. Eastern. “We’ve successfully rerouted traffic, and we’re seeing an improvement in service availability.”
In a blog this morning Cloudflare officials said traffic in 19 of its data centers were affected. Unfortunately they handle a significant proportion of its global traffic.
“This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations,” officials said. “A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.”
“We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.”
Over the last 18 months Cloudflare has been trying to convert all of its busiest locations to a more flexible and resilient architecture, the company said. A critical part of this new architecture, which is designed as a Clos network, is an added layer of routing that creates a mesh of connections. This mesh allows Cloudflare to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem
Like other IT networks, Cloudflare uses the BGP protocol. As part of this protocol, operators define policies that decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers.
These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised. A change in policy can mean a previously advertised prefix is no longer advertised, known as being “withdrawn”, and those IP addresses will no longer be reachable on the Internet.
While deploying a change to Cloudflare’s prefix advertisement policies, a re-ordering of terms caused the withdrawal of a critical subset of prefixes, causing things to go sideways.