On March 15 2021, Microsoft teams along side many other services experienced a global outage.
Microsoft has released a ROOT cause ANALYSIS of the incident. In this video we will summarize what caused the outage and what Microsoft did to resolve it.
If you like my content Like & subscribe to get notified when I post new videos, I specialized in backend engineering discussions. Lets get into it
Microsoft services relay on Azure active directory for authentication and authorization.
Each service gets token and verify the token with a signing key 🔑 to make sure the token is still valid. And as part of automated security higine, Microsoft does a key rotation and invalidate keys that is no longer used.
There was a bug in the automated key rotation that removed a signing key that was not supposed to be removed. Unfortunately this key signed so many tokens that are being used by many services
As a result of that removal, the metadata about the keys has been downloaded by all services and all those tokens was marked as invalid (key is no longer trusted).
Users connecting to these services started to get errors because of this.
Microsoft engineering quickly realized that and reverted the metadata to force the key to be trusted again
However because of each service already cached that knowledge that the key was untrusted it wouldn’t refresh the new metadata (cache invalidation is the most difficult problem)
That exacerbated the problem, some services went down while others remain untrusting those token
Engineers finally pushed a fix to force a refresh of keys metadata to force services to pull new metadata and trust the key again
This is when the services started coming back to normal,