Software

This article is more than 1 year old

Worst-case Scenarios? You've got it: Gremlin makes totally trashing your apps even easier

Chaos merchant's failure-as-a-service tests system resilience

Thu 26 Sep 2019 // 17:00 UTC

Chaos-engineering company Gremlin has launched Scenarios – "templates of real-world outages" that make it easier to wreck your applications.

Gremlin announced the product at the Chaos Conf 2019 taking place in San Francisco. Scenarios include traffic spikes for testing what happens under severe load; unreliable networks for when your microservice API calls start taking ages to respond; and region evacuation, for when a cloud region becomes unavailable.

The idea of chaos engineering is to cause deliberate failure in order to investigate whether your application or system is resilient. Chaos engineering tools can consume 100 per cent of CPU, shut down a percentage of your hosts, make DNS calls unresponsive, or introduce severe latency into networks, so you can discover whether planned resiliency, like failover systems, actually work as designed – in the same way as you validate a backup by doing a test restore.

Failure types in the Gremlin user interface

We spoke to Gremlin's Senior Site Reliability Engineer (SRE), Tammy Butow, at the Qcon conference in London. "The history starts with Netflix when they were moving to AWS," she told us. "They thought, how do we make sure that this does work? They started by creating Chaos Monkey, which they later open-sourced. That was about, if we shut down a server, is everything OK? That helped them provide feedback to AWS."

Chaos Monkey is free but can be complex to deploy.

"We're trying to prevent downtime and we're trying to prevent data loss," Butow added. "Back when I worked at National Australia Bank we did disaster recovery tests. You have to do those to get your banking licence. But if you're in a tech startup, there's nobody that holds you accountable, to prove that your system is resilient and that you're looking after your customer's data."

The failures injected by Gremlin are not simulated, except in the sense that they can be paused or removed. "If you do it the wrong way it can be dangerous," said Butow.

The key is to start small. The "blast radius" of a test determines how wide its impact is. "I like to do a CPU attack first. It's the Hello World of chaos engineering," Butow said.

You can begin by taking down just one or two servers, then expand to taking down whole services or an entire region. A service like Gremlin provides an API and a control plane, so you can automate and schedule tests.

Just like in the security world, many failures come about due to people using services in unexpected ways. A common example is APIs. "When people build APIs they don't think anyone's going to abuse the API," said Butow. "As an SRE I'm always looking for how can things break."

That you cannot call a system resilient until you have seen it survive massive failures is common sense, but as with backups, many organisations still end up learning the hard way. ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

Software

Worst-case Scenarios? You've got it: Gremlin makes totally trashing your apps even easier

Chaos merchant's failure-as-a-service tests system resilience

More about

TIP US OFF

Other stories you might like

Samsung shows off battery tech it says will see you gone in nine minutes

IBM to acquire Hashi for $6.4 billion, hopes it will boost software biz and Red Hat

Australia’s spies and cops want ‘accountable encryption’ - aka access to backdoors

Getting on board with AI

Governments issue alerts after 'sophisticated' state-backed actor found exploiting flaws in Cisco security boxes

With Run:ai acquisition, Nvidia aims to manage your AI kubes

Apple releases OpenELM, a slightly more accurate LLM

Musk moves Tesla's goalposts, investors happily move shares higher

Shouldn't Teams, Zoom, Slack all interoperate securely for the Feds? Wyden is asking

Now all Windows 11 users are getting adverts to 'make the Start menu great again'

Lenovo and Micron first to implement LPCAMM2 in laptop

Microsoft cannot keep its own security in order, so what hope for its add-ons customers?

About Us

Our Websites

Your Privacy