Microsoft's Azure Kubernetes Service mucked my cluster!

Redmond blames user error, invites further feedback to improve its service


Microsoft's Azure Kubernetes Service (AKS) was launched to world+dog in June, however, a few disgruntled customers say the managed container confection isn't fully baked yet.

In a blog post published on Monday, Prashant Deva, creator of an app and infrastructure monitoring service called DripStat, savaged AKS, calling it "an alpha service marked as GA [generally available] by Microsoft."

Deva said he moved his company's production workload to AKS last month, and has been plagued by random DNS failures for domains outside of Azure and hostnames inside the Azure Virtual Network.

He characterized the response from Microsoft support – advice not to use excessive memory and CPU resources – as ridiculous, and said Microsoft failed to respond when told the DNS issues occurred mainly during application startup when memory and CPU usage is minimal.

Then there was the AKS Kubernetes Dashboard, which crashed after a few days and required a reboot of the Kubernetes API Server to fix. And this happened, Deva said, on a daily basis, which meant the constant filing of support tickets.

Have you tried turning your infrastructure off and on again?

When Docker containers crashed, the underlying virtual machine would fail too, according to Deva. Recovery required manually rebooting the VM from the Azure portal. He described the response he got from Azure Support thus: "Yeah this is your problem. Just make sure your containers never crash."

He recounts an unrecoverable cluster crash, and claims the service-level agreement (SLA), coving the VMs underlying AKS but not AKS itself, was violated.

"Azure Support has been the worst support experience of my life," he said, noting that he's moved to Google Cloud Platform for its Kubernetes service. "...Ignoring the SLA violation is downright fraudulent behavior."

Reached via Twitter's private message system, Deva said his experience was limited to AKS and didn't reflect other Azure services.

"This has been very poorly handled by Microsoft," he told The Register. "The worst part is them trying to blame the user for issues on their end."

In an email to El Reg, a Microsoft spokesperson attributed the problem to Deva running workloads without a memory limit:

In the course of an in-depth engagement by our engineering team, we determined that the customer’s workloads had been overscheduled on the nodes in his cluster, crowding out system services and causing undesirable behavior.

We provided recommendations for how the customer could prevent this from reoccurring and have made corresponding improvements in AKS to ensure that customers cannot inadvertently get into this situation again. We are also continuing to invest in providing better diagnostic and monitoring tools so that customers and our own support engineers can more quickly determine what might be causing problems in a customer’s environment. We are always concerned if a customer has an issue with AKS and we will use this feedback to continue to improve the service and our support process.

An individual posting under the name QiKe, claiming to be an engineering lead on AKS, offered a similar explanation in a post to Hacker News.

Deva is not the only AKS customer to report misadventures. Colin Jemmott, senior data scientist at Seismic Software, observed via Twitter, "This matches my experience with @Azure managed Kubernetes (AKS)."

In late June, Wojciech Barczyński, a senior software engineer at SMACC, a deep learning and finance biz, described a number of issues that arose using AKS. He hasn't jumped ship, however, he advises people to skip "first bumpy GA months" and wait until the service becomes more stable.

"The AKS team gets more and more experience with time and the growing number of clients," he observed. "So, the service improves fast."

At the same time, AKS has fans. One person chiming in on the Hacker News thread remarked, "I've had wildly different results. My shop wasn't large by any means but Azure worked pretty much perfectly for us."

We should all be so fortunate. ®


Other stories you might like

  • GPL legal battle: Vizio told by judge it will have to answer breach-of-contract claims
    Fine-print crucially deemed contractual agreement as well as copyright license in smartTV source-code case

    The Software Freedom Conservancy (SFC) has won a significant legal victory in its ongoing effort to force Vizio to publish the source code of its SmartCast TV software, which is said to contain GPLv2 and LGPLv2.1 copyleft-licensed components.

    SFC sued Vizio, claiming it was in breach of contract by failing to obey the terms of the GPLv2 and LGPLv2.1 licenses that require source code to be made public when certain conditions are met, and sought declaratory relief on behalf of Vizio TV owners. SFC wanted its breach-of-contract arguments to be heard by the Orange County Superior Court in California, though Vizio kicked the matter up to the district court level in central California where it hoped to avoid the contract issue and defend its corner using just federal copyright law.

    On Friday, Federal District Judge Josephine Staton sided with SFC and granted its motion to send its lawsuit back to superior court. To do so, Judge Staton had to decide whether or not the federal Copyright Act preempted the SFC's breach-of-contract allegations; in the end, she decided it didn't.

    Continue reading
  • US brings first-of-its-kind criminal charges of Bitcoin-based sanctions-busting
    Citizen allegedly moved $10m-plus in BTC into banned nation

    US prosecutors have accused an American citizen of illegally funneling more than $10 million in Bitcoin into an economically sanctioned country.

    It's said the resulting criminal charges of sanctions busting through the use of cryptocurrency are the first of their kind to be brought in the US.

    Under the United States' International Emergency Economic Powers Act (IEEA), it is illegal for a citizen or institution within the US to transfer funds, directly or indirectly, to a sanctioned country, such as Iran, Cuba, North Korea, or Russia. If there is evidence the IEEA was willfully violated, a criminal case should follow. If an individual or financial exchange was unwittingly involved in evading sanctions, they may be subject to civil action. 

    Continue reading
  • Meta hires network chip guru from Intel: What does this mean for future silicon?
    Why be a customer when you can develop your own custom semiconductors

    Analysis Here's something that should raise eyebrows in the datacenter world: Facebook parent company Meta has hired a veteran networking chip engineer from Intel to lead silicon design efforts in the internet giant's infrastructure hardware engineering group.

    Jon Dama started as director of silicon in May for Meta's infrastructure hardware group, a role that has him "responsible for several design teams innovating the datacenter for scale," according to his LinkedIn profile. In a blurb, Dama indicated that a team is already in place at Meta, and he hopes to "scale the next several doublings of data processing" with them.

    Though we couldn't confirm it, we think it's likely that Dama is reporting to Alexis Bjorlin, Meta's vice president of infrastructure hardware who previously worked with Dama when she was general manager of Intel's Connectivity group before serving a two-year stint at Broadcom.

    Continue reading

Biting the hand that feeds IT © 1998–2022