Microsoft AzureでVMの起動や作成ができない障害に関するまとめ

スポンサーリンク

2021年10月13日(水) 14時12分~20時45分(日本時間)において、Microsoft AzureのWindows仮想マシンに対する起動や作成などの管理操作ができなくなる障害が発生しました。

本障害は東日本リージョンや西日本リージョンを含む、グローバルの多くの地域にて発生していましたが、同日21時時点で障害はすべて解消しているとのことです。

本記事では、Microsoft社が公式に発表している情報をもとに、発生事象の内容についてまとめています。

【2021/10/15(金) 00:00 追記】 Azure status historyにて公開されている原文をテキストで転記。
【2021/10/16(土) 14:45 追記】 Azure status historyにてRCAが公開されたため、原文を本記事に転記。

発生事象

2021年10月13日(水) 05時12分~11時45分(UTC)において、Windows仮想マシンに対する管理操作(起動、作成、更新、削除など)が実行できない障害が発生しました。また、Windows仮想マシンに依存するサービスにおいても、リソースの作成時に同様の障害が発生した可能性があります。

ただし、Windows以外の仮想マシンや既に起動中のWindows仮想マシンについては、本障害による影響はありませんでした。

根本原因

Microsoft Azureのエンジニアリングチームでは、バックエンドのコンピューティングリソースプロバイダー(CRP)に焦点を当てて調査を実施しました。その結果、必要なVM Guest Agentをリポジトリからクエリできないことが判明し、その結果、サービス管理操作中に行われた呼び出しが失敗していたことを特定しました。

なお、VM Guest Agent Extensionパブリッシングアーキテクチャは、レガシーサービス管理バックエンドシステムの移行の一部として、最新のAzure Resource Manager(ARM)機能を活用する新しいプラットフォームに移行されていました。

対応内容

対象の拡張機能を正しい期待レベルに変更することで、影響を緩和しました。エンジニアは、更新が完了した後、操作が正しく動作することを検証しました。

Microsoft社からの公式発表内容

本障害に関する情報は、Azure status historyにて公開されています。

Azure の状態の履歴 | Microsoft Azure
Microsoft Azure サービスの状態の履歴はこちらで確認できます。

2021年10月14日(木) 01時30分 (日本時間)

2021年10月14日(木) 01時30分、Azure status historyにおいて以下の情報が公開されています。
現段階では完全な根本原因特定に向けて引き続き調査を行っており、根本原因分析(RCA)は72時間以内に公開される予定とのことです。

Summary of impact

Between 05:12 UTC and 11:45 UTC on 13 Oct 2021, a subset of customers using Windows Virtual Machines may have received failure notifications when performing service management operations – such as start, create, update, delete. Deployments of new VMs and any updates to extensions may have failed. Non-Windows Virtual Machines, and existing running Windows Virtual Machines should not have been impacted by this issue. Additionally, services with dependencies on Windows VMs may have also experienced similar failures when creating resources.

Preliminary root cause

We identified that calls made during service management operations were failing as a required artifact version data could not be queried. Our investigation focused on the backend compute resource provider (CRP) to determine why the calls were failing, and identified that a required VMGuestAgent could not be queried from the repository. The VM Guest Agent Extension publishing architecture was being migrated (as part of a migration of legacy service management backend systems) to a new platform which leverages the latest Azure Resource Manager (ARM) capabilities.

Mitigation

We mitigated impact by marking the appropriate extensions to the correct expected level (in this case, public). Engineers proactively verified the return to full success rate for operations after the updates were completed.

Next Steps

We will continue to investigate to establish the full root cause and prevent future occurrences. A full Root Cause Analysis (RCA) will be published within 72 hours.

2021年10月16日(土) 14時30分 (日本時間)

2021年10月16日(土) 14時30分、 Azure status historyにて根本原因分析(RCA)が公開されています。

Summary of Impact

Between 06:27 UTC and 12:42 UTC on 13 Oct 2021, a subset of customers using Windows-based Virtual Machines (Windows VM) may have received failure notifications when performing service management operations – such as start, create, update, delete. Deployments of new VMs and any updates to extensions may have failed. Management operations on Availability Set, Virtual Machine Scale Set were also impacted.

Non-Windows Virtual Machines were unaffected, however services with dependencies on Windows VMs may have also experienced similar failures when creating resources.

Root Cause

Windows-based Virtual Machines utilize the Windows Virtual Machine Agent (VM Agent) extension, which is used to manage interactions between the Virtual Machine and the Azure Fabric.

When creating and updating Windows VMs, the Compute Resource Provider (CRP) has a dependency upon the Platform Image Repository to retrieve download locations for the latest version of the VM Agent package. Using this information, the VM Agent will update itself to the latest version in the VM.

As part of the journey to move all classic resources to Azure Resource Manager (ARM), we are migrating the image and extension publishers to the regional ARM publishing pipeline. Approximately 20% of all extensions have been successfully migrated.

At approximately 06:27 UTC, tooling provided an ARM template for use in performing these migrations. This tooling did not consider an edge case and as an unintended consequence marked the Windows VM Agent extension as visible to the publishing subscription only in the ARM regional service after migration. As the result, VM management operations started to fail after receiving zero results from the regional Platform Image Repositories.

The outcome of this was that service management operations (start, stop, create, delete, etc.) on customers Windows VM were unable to locate the Windows VMAgent extension, and thus unable to complete successfully.

Part of our change management process is to leverage the Safe Deployment Practice (SDP) framework. (https://azure.microsoft.com/en-us/blog/advancing-safe-deployment-practices/). In this case, some of the functionality of our classic infrastructure is incompatible with the SDP framework. This incompatibility underscores the importance in which we are treating the complete migration to ARM. Once the migration is complete, it will allow us to make all changes using the SDP framework without using bespoke tools that support classic resources only.

Mitigation

Determining the root cause took an extended period due to multiple releases for Azure components being in flight simultaneously on the platform, each of which had to be investigated. Additionally, involving subject matter experts (SMEs) for each of the involved components added to this time as we needed to eliminate multiple possible scenarios to ensure we could triage the underlying cause.

Once we determined the issue, and reviewed multiple mitigation options, we mitigated impact by making the extension public in one region at first and validating the results, ensuring no further impact would be caused by a surge in requests for Virtual Machines. Once validated, we started rolling out the change to the new pipeline region-by-region, mitigating the issue. Engineers monitored the platform success rate for operations after the changes were completed.

Next Steps

We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • The migration of remaining packages in this category (including the Linux version of the VM Agent) is on hold until all repairs are in place
  • Additional pre-check and post-checks are being developed and implemented
  • VM operation resilience to failures when VM agent cannot be found
  • Engineering is also evaluating other safeguards to flight each extension type and prevent any potential negative impact with the remainder of migration.

コメント

タイトルとURLをコピーしました