This article is more than 1 year old

Microsoft 'fesses to code blunder in Azure Container Apps

Misconfiguration led to hours of pain for engineers as bootstrap service caught in a loop

A code deployment for Azure Container Apps that contained a misconfiguration triggered prolonged log data access issues, according to a technical incident report from Microsoft.

The incident, which began at 23:15 UTC on July 6 and ran until 09:00 the following day, meant a subset of data for Azure Monitor Log Analytics and Microsoft Sentinel "failed to ingest."

On top of that, the platforms logs collated via the Diagnostic Setting didn't route some data to "customer destinations" including Log Analytics Storage, Event Hub and Marketplace.

And the Security Operations Center functionality in Sentinel "including hunting queries, workbooks with custom queries, and notebooks that queried impacted tables with data range inclusive of the logs data that we failed to ingest, might have returned partial or empty results."

The report adds: "In cases where Event or Security Event tables were impacted, incident investigations of a correlated incident may have showed partial or empty results. Unfortunately, this issue impacted one or more of your Azure resources."

So what went wrong and why?

"A code deployment for the Azure Container Apps service was started on 3 July 2023 via the normal Safe Deployment Practices (SDP), first rolling out to Azure canary and staging regions. This version contained a misconfiguration that blocked the service from starting normally. Due to the misconfiguration, the service bootstrap code threw an exception, and was automatically restarted."

This resulted in the bootstrap service being "stuck in a loop" that meant it was being restarted every five to ten seconds. Every time the service was restarted, it provided config information to the telemetry agents that are also installed on the service hosts, and they interpreted this as a configuration change so automatically exited their existing process and restarted as well.

"Three separate instances of the agent telemetry host, per applications host, were now also restarting every five to ten seconds," adds Microsoft in the report.

On each startup, the telemetry agent then communicated with the control plane to download the latest telemetry configuration version – something that would only typically happen once over several days. Yet as the deployment of the Container App Service ran on, several hundreds hosts had their telemetry agent nagging repeatedly.

The bug in the deployment was finally detected and stopped before it was released to production regions, with the Container Apps team commencing deployment of their new service in the canary and staging region to deal with the misconfiguration.

"However, the aggregate rate of requests from the services that received the build with the misconfiguration exhausted capacity on the telemetry control plane. The telemetry control plane is a global service, used by services running in all public regions of Azure.

"As capacity on the control plane was saturated, other services involved in ingestion of telemetry, such as the ingestion front doors and the pipeline services that route data between services internally, began to fail as their operations against the telemetry control plane were either rejected or timed out. The design of the telemetry control plane as a single point of failure is a known risk, and investment to eliminate this risk has been underway in Azure Monitor to design this risk out of the system."

So the misconfigured code wasn't deployed to production but it impacted production. Micfrosoft says that at 23.15 on July 6, "external customer impact started, as cached data started to expire".

Closing off the mea culpa, Microsoft said it is aware that "trust is earned and must be maintained" and that "data retention is a fundamental responsibility of the Microsoft cloud, including every engineer working on every cloud service."

To make this type of incident less likely, the vendor says it has tried to make sure telemetry control plane services are running with extra capacity, and has committed to create more alerts on metrics that point to unusual failure patterns in API calls.

Microsoft also said it is adding new positive and negative caching to the control plane; will establish more throttling and circuit breaker pattern to core telemetry control plane APIs; and create "isolation" between internal and external-facing services that use the telemetry control plane.

It's good to know what is going on inside of the Azure machine and Microsoft is highly transparent in these situations. It'd be better if reports like this were not required in the first instance – we can but live in hope. ®

More about

TIP US OFF

Send us news


Other stories you might like