Microsoft has identified a faulty Enterprise Configuration Service (ECS) deployment as the root cause of the outage, which affected several of the company’s services.
ECS is an internal central configuration repository that enables Microsoft services to make far-reaching dynamic changes across multiple services and functions. ECS also makes changes to specific configurations per tenant or user.
“A deployment in the ECS service contained a code defect that affected backward compatibility with services that leverage ECS. The net result was that for services that utilize ECS it would return incorrect configurations to all its partners,” the company said.
Regarding the extent of the impact, Microsoft explained that the impact was based on how “individual Microsoft services utilize the malformed configuration provided by ECS. Impact ranged from services crashing such as Teams while other services experienced limited to no impact.”
Microsoft said it is working to improve the resilience of the Microsoft Teams service by using a cached ECS configuration version in case of a future ECS failure.
The company is investing in additional fault isolation to limit the impact of an ECS failure, and there is also a need to update monitoring thresholds to better detect substandard errors.
The sources for this piece include an article in BleepingComputer.