In the wake of the last month's distributed denial of service (DDoS) attack against Dyn, a DNS management service, Google engineers want to remind application developers that self-harm represents a more realistic risk.
Just as US citizens have a greater chance of being crushed by falling furniture than to dying at the hands of terrorists, developers are more likely to DDoS themselves than suffer downtime from an external attack, or so claim Google site reliability engineers Dave Rensin and Adrian Hilton in a blog post.
The Register asked Google if it could quantify that claim with specific figures, but we haven't yet heard back.
Rensin and Hilton observe that software engineers commonly assume that system load will be evenly distributed and fail to plan for the alternative.
As an example, the pair describe a mobile app that periodically fetches data from a backend server. "Because the information isn’t super time sensitive, you write the client to sync every 15 minutes," they suggest. "Of course, you don’t want a momentary hiccup in network coverage to force you to wait an extra 15 minutes for the information, so you also write your app to retry every 60 seconds in the event of an error."
The problem with this approach may become apparent if and when there's a service disruption, which happens on occasion. When your backend comes back online, it gets hit with the expected periodic requests for data and with any delayed requests for data.
The result is double the expected traffic. With only one minute of downtime, the traffic load becomes unbalanced. Two-fifteenths of the apps on a 15-minute sync schedule users get locked into the same sync timing.
"Thus, in this state, for any given 15-minute period you'll experience normal load for 13 minutes, no load for one minute and 2x load for one minute," said Rensin and Hilton.
And, as the pair observes, most service disruptions last longer than a minute. With 15 minutes of downtime, all your users will get pushed into fetching data when service gets restored, meaning you'd need to provision for 15x normal capacity. Add the possibility that repeated tries to establish a connection can stack on load balancers and there's more pressure on the backend. So, preparing for 20x traffic or more may be necessary.
"In the worst case, the increased load might cause your servers to run out of memory or other resources and crash again," said Rensin and Hilton. "Congratulations, you’ve been DDoS’d by your own app!"
To avoid shooting yourself in the foot, the pair advise exponential backoff, adding jitter, and marking retry requests.
Exponential backoff involves adding a delay that doubles with every failed retry, to create more delay between failed reconnection attempts.
Jitter provides another way to vary the timing of retry attempts, by adding or subtracting a fixed amount of time to an app's connection schedule.
Finally, by tracking retry attempts with a counter, application logic can prioritize which clients connect when there's a queue of unfulfilled connection attempts as a result of downtime. ®