AWS reveals it broke itself by exceeding OS thread limits, sysadmins weren’t familiar with some workarounds

First solution: run on bigger servers to reduce chatter in the Kinesis fleet


Amazon Web Services has revealed that adding capacity to an already complex system was the reason its US-EAST-1 region took an unplanned and rather inconvenient break last week.

The short version of the story is that the company’s Kinesis service, which is used directly by customers and underpins other parts of AWS’ own operations, added more capacity. Servers in the Kinesis fleet need to communicate with each other, and to do so create new threads for each of the other servers in the front-end fleet. AWS says there are “many thousands of servers” involved and that when new servers are added it can take up to an hour for news of additions to reach the entire fleet.

Adding capacity therefore “caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.”

In the short term, we will be moving to larger servers, reducing the total number of servers and, hence, threads.

AWS figured that out, but also learned that fixing the problem meant rebooting all of Kinesis.

But it was only possible to bring “a few hundred” servers back at a time, and as we’ve seen above Kinesis uses “many thousands of servers”. Which explains why recovery from the outage was slow.

The whole sad story is explained in much greater detail in this AWS post, which also explains how it plans to avoid such incidents in future.

Plan one: use bigger servers.

“In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet,” the post says, explaining that doing so “will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet.”

The company also plans new “fine-grained alarming for thread consumption in the service” and plans “an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well.”

Also on the agenda: isolating in-demand services like CloudWatch to separate dedicated Kinesis servers.

Dashboard dashed by dependencies

The TIFU-like post also outlines why Amazon's dashboards offered only scanty info about the incident – because they, too, depend on a service that depends on Kinesis.

AWS has built a dependency-lite way to get info to the Service Health Dashboard it uses as a public status page. The post says it worked as expected, but “we encountered several delays during the earlier part of the event … as it is a more manual and less familiar tool for our support operators.”

The cloud therefore used the Personal Health Dashboard, visible to impacted customers only.

The post ends with an apology: “While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end users, and their businesses.”

“We will do everything we can to learn from this event and use it to improve our availability even further.” ®

Similar topics

Narrower topics


Other stories you might like

  • US won’t prosecute ‘good faith’ security researchers under CFAA
    Well, that clears things up? Maybe not.

    The US Justice Department has directed prosecutors not to charge "good-faith security researchers" with violating the Computer Fraud and Abuse Act (CFAA) if their reasons for hacking are ethical — things like bug hunting, responsible vulnerability disclosure, or above-board penetration testing.

    Good-faith, according to the policy [PDF], means using a computer "solely for purposes of good-faith testing, investigation, and/or correction of a security flaw or vulnerability."

    Additionally, this activity must be "carried out in a manner designed to avoid any harm to individuals or the public, and where the information derived from the activity is used primarily to promote the security or safety of the class of devices, machines, or online services to which the accessed computer belongs, or those who use such devices, machines, or online services."

    Continue reading
  • Intel plans immersion lab to chill its power-hungry chips
    AI chips are sucking down 600W+ and the solution could be to drown them.

    Intel this week unveiled a $700 million sustainability initiative to try innovative liquid and immersion cooling technologies to the datacenter.

    The project will see Intel construct a 200,000-square-foot "mega lab" approximately 20 miles west of Portland at its Hillsboro campus, where the chipmaker will qualify, test, and demo its expansive — and power hungry — datacenter portfolio using a variety of cooling tech.

    Alongside the lab, the x86 giant unveiled an open reference design for immersion cooling systems for its chips that is being developed by Intel Taiwan. The chip giant is hoping to bring other Taiwanese manufacturers into the fold and it'll then be rolled out globally.

    Continue reading
  • US recovers a record $15m from the 3ve ad-fraud crew
    Swiss banks cough up around half of the proceeds of crime

    The US government has recovered over $15 million in proceeds from the 3ve digital advertising fraud operation that cost businesses more than $29 million for ads that were never viewed.

    "This forfeiture is the largest international cybercrime recovery in the history of the Eastern District of New York," US Attorney Breon Peace said in a statement

    The action, Peace added, "sends a powerful message to those involved in cyber fraud that there are no boundaries to prosecuting these bad actors and locating their ill-gotten assets wherever they are in the world."

    Continue reading

Biting the hand that feeds IT © 1998–2022