2021年2月26日に発生しました Azure ストレージで障害について、RCA (Root Cause Analysis) レポートの日本語版のご報告を紹介いたします。
RCA - Azure Storage and dependent services - Japan East (Tracking ID PLWV-BT0)
Summary of Impact: Between 03:26 UTC and 10:02 UTC on 26 Feb 2021, a subset of customers in Japan East may have experienced service degradation and increased latency for resources utilizing Azure Storage, including failure of virtual machine disks. Some Azure services utilizing Storage may have also experienced downstream impact.
Summary Root Cause: During this incident, the impacted storage scale unit was under heavier than normal utilization. This was due to:
- Incorrect limits set on the scale unit which allowed more load than desirable to be placed on it. This reduced the headroom that is usually available for unexpected events such as sudden spikes in growth which allows time to take load-balancing actions.
- Additionally, the load balancing automation was not sufficiently spreading the load to other scale units within the region.
This high utilization triggered heavy throttling of storage operations to protect the scale unit from catastrophic failures. This throttling resulted in failures or increased latencies for storage operations on the scale unit.
Note: The original RCA mistakenly identified a deployment as a triggering event for the increased load. This is because during an upgrade, the nodes to be upgraded are removed from rotation, temporarily increasing load on remaining nodes. An upgrade was in queue on the scale unit but had not yet started. Our apologies for the initial mistake.
Background: An internal automated load balancing system actively monitors resource utilization of storage scale units to optimize load across scale units within an Azure region. For example, resources such as disk space, CPU, memory and network bandwidth are targeted for balancing. During this load balancing, storage data is migrated to a new scale unit, validated for data integrity at the destination and finally the data is cleaned up on the source to return free resources. This automated load-balancing happens continuously and in real-time to ensure workloads are properly optimized across available resources.
Detailed Root Cause: Prior to the start of impact, our automated load-balancing system had detected high utilization on the scale-unit and was performing data migrations to balance the load. Some of these load-balancing migrations did not make sufficient progress, creating a situation where the resource utilization on the scale unit reached levels that were above the safe thresholds that we try to maintain for sustained production operation. This kick-started automated throttling on incoming storage write requests to protect the scale unit from catastrophic failures. When our engineers were engaged, they also detected that the utilization limits that were set on the scale unit to control how much data and traffic should be directed to the scale unit was higher than expected. This did not give us sufficient headroom to complete load-balancing actions to prevent customer facing impact.
Mitigation: To mitigate customer impact as fast as possible, we took the following actions:
- Engineers took steps to aggressively balance resource load out of the storage scale unit. The load-balancing migrations that were previously unable to finish were manually unblocked and completed, allowing a sizeable quantity of resources to be freed up for use. Additionally, load-balancing operations were tuned to improve its throughput to more effectively distribute load.
- We prioritized recovery of nodes with hardware failures that had been taken out of rotation to bring additional resources online.
These actions brought the resource utilization on the scale unit to a safe level which was well below throttling thresholds. Once Storage services were recovered around 06:56 UTC, dependent services started recovering. We declared full mitigation at 10:02 UTC.
Next steps: We sincerely apologize for the impact this event had on our customers. Next steps include but are not limited to:
- Optimize the maximum allowed resource utilization levels on this scale unit to provide increased headroom in the face of multiple unexpected events.
- Improve existing detection and alerting for cases when load-balancing is not keeping up, so corrective action can be triggered early to help avoid customer impact.
- Improve load-balancing automation to handle certain edge-cases under resource pressure where manual intervention is currently required to help prevent impactful events.
- Improve emergency-levers to allow for faster mitigation of impactful resource utilization related events.
Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: https://aka.ms/AzurePIRSurvey