Massive Microsoft 365 outage due to WAN router IP change

Massive Microsoft 365 outage due to WAN router IP change – BleepingComputer

Microsoft

According to Microsoft, this week’s five-hour global outage of Microsoft 365 was caused by a change in a router’s IP address, which caused packet forwarding problems between all other routers on its wide area network (WAN).

Redmond said at the time that the outage was due to DNS and WAN network configuration issues caused by a WAN update, and that users in all regions served by the affected infrastructure were experiencing issues accessing the affected Microsoft 365 had services.

The issue resulted in ripple service impacts, peaking approximately every 30 minutes, as reported on the Microsoft Azure service status page (this status page was also affected as it intermittently displayed “504 Gateway Timeout” errors).

The list of services affected by the outage included Microsoft Teams, Exchange Online, Outlook, SharePoint Online, OneDrive for Business, PowerBi, Microsoft 365 admin center, Microsoft Graph, Microsoft Intune, Microsoft Defender for Cloud Apps, and Microsoft Defender for Identity.

Overall, it took Redmond over five hours to fix the issue, from 7:05 a.m. UTC when it began investigating to 12:43 p.m. UTC when service was restored.

“Between 07:05 UTC and 12:43 UTC on January 25, 2023, customers experienced network connectivity issues, manifested as long network latencies and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services, including Microsoft 365 and Power Platform,” Microsoft said in a preliminary report released today after the incident.

“While most regions and services had recovered by 09:00 UTC, intermittent packet loss issues were fully resolved by 12:43 UTC. This incident also impacted Azure Government Cloud services that depended on the Azure public cloud.”

Microsoft has now also revealed that the problem was triggered when a WAN router’s IP address was changed using a command that has not been thoroughly verified and that behaves differently on different network devices.

“As part of a planned change to update the IP address on a WAN router, a command given to the router caused messages to be sent to all other routers on the WAN, causing all of them to recalculate their adjacency and forwarding tables.” said microsoft.

“During this recalculation, the routers could not correctly forward packets passing through them.”

As of 08:10 UTC, while the network recovered on its own, the automated systems responsible for maintaining the health of the wide area network (WAN) paused due to the impact on the network.

These systems included systems to identify and eliminate faulty devices, and traffic engineering systems to optimize the flow of traffic over the network.

As a result of the pause, some network paths continued to experience increased packet loss as of 9:35 UTC until systems were manually rebooted, returning the WAN to optimal operating conditions and completing the recovery process at 12:43 UTC.

Following this incident, Microsoft says it will now block the execution of very powerful commands and will also require that all command executions follow guidelines for safe configuration changes.