Refer a friend and get % off! They'll get % off too.

Microsoft Outage March 2021 - Slides







On March 15 2021, Microsoft teams along side many other services experienced a global outage. 




Microsoft has released a ROOT cause ANALYSIS of the incident. In this video we will summarize what caused the outage and what Microsoft did to resolve it. 


If you like my content Like & subscribe to get notified when I post new videos, I specialized in backend engineering discussions. Lets get into it







Microsoft services relay on Azure active directory for authentication and authorization.




Each service gets token and  verify the token with a signing key 🔑 to make sure the token is still valid. And as part of automated security higine, Microsoft does a key rotation and invalidate keys that is no longer used. 




There was a bug in the automated key rotation that removed a signing key that was not supposed to be removed. Unfortunately this key signed so many tokens that are being used by many services 




As a result of that removal, the metadata about the keys has been downloaded by all services and all those tokens was marked as invalid (key is no longer trusted). 




Users connecting to these services started to get errors because of this.




Microsoft engineering quickly realized that and reverted the metadata to force the key to be trusted again




However because of each service already cached that knowledge that the key was untrusted it wouldn’t refresh the new metadata (cache invalidation is the most difficult problem)




That exacerbated the problem, some services went down while others remain untrusting those token




Engineers finally pushed a fix to force a refresh of keys metadata to force services to pull new metadata and trust the key again 




This is when the services started coming back to normal, 

You will get a PDF (3MB) file

$ 0.00

This item is free.

Download Now

Discount has been applied.

Added to cart
or
Add to Cart
Adding ...