SaaS

This article is more than 1 year old

Google BigQuery TITSUP caused by failure to scale-yer workloads

Engineers went head down, bum up but zipped lips gave users the … heebie jeebies

Mon 14 Nov 2016 // 01:57 UTC

A four-hour outage of Google's BigQuery streaming service has taught the cloud aspirant two harsh lessons: its cloud doesn't always scale as well as it would like, and; it needs to explain itself better during outages.

The Alphabet subsidiary's trouble started last Tuesday when a surge in demand for the BigQuery authorization service “caused a surge in requests … exceeding their current capacity.” Or as we like to say here at El Reg, it experienced a Total Inability To Support Usual Performance and went TITSUP.

As Google explains, “The BigQuery streaming service requires authorization checks to verify that it is streaming data from an authorized entity to a table that entity has permissions to access.” There's a cache between the authorization service and its backend, but because “Because BigQuery does not cache failed authorization attempts, this overload meant that new streaming requests would require re-authorization, thereby further increasing load on the authorization backend.”

As authorization requests piled up, the strain on the already-stressed authorization backend meant “continued and sustained authorization failures which propagated into streaming request and query failures.”

All of which meant that for four hours last Tuesday afternoon (US Pacific time) 73 per cent of BigQuery streaming inserts failed with a 503 error code. At the peak of the problem, the failure rate was 93 per cent.

Google's now figured out that its cache wasn't big enough and that the authorization backend lacked capacity, fascinating admissions given that the premise of cloud is that it will just scale as demand increases. Google seems to have missed some tricks here, as among the tactics in its remediation plan is “improving the monitoring of available capacity on the authorization backend and will add additional alerting so capacity issues can be mitigated before they become cascading failures.”

The second thing Google's promised to do better in future is explain itself.

As the incident report says, “... we have received feedback that our communications during the outage left a lot to be desired. We agree with this feedback.”

“While our engineering teams launched an all-hands-on-deck to resolve this issue within minutes of its detection, we did not adequately communicate both the level-of-effort and the steady progress of diagnosis, triage and restoration happening during the incident. We clearly erred in not communicating promptly, crisply and transparently to affected customers during this incident.”

“We will be addressing our communications — for all Google Cloud systems, not just BigQuery — as part of a separate effort, which has already been launched.”

Google's being a little hard on itself here, as its incident reports are comfortably more detailed than those of its cloudy rivals. But on this occasion it appears users wanted even more.

The ad giant's decision to improve its crash-time comms surely bespeaks its cloudy aspirations: Amazon Web Services and Microsoft's Azure are considered the leaders of the cloud market, with Google and IBM scrapping for third place ahead of fast-growing contenders like OVH and Digital Ocean. And those rivals can all point to Google breaking its own cloud, sometimes with careless errors. ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

SaaS

Google BigQuery TITSUP caused by failure to scale-yer workloads

Engineers went head down, bum up but zipped lips gave users the … heebie jeebies

More about

TIP US OFF

Other stories you might like

Germany cuffs alleged Russian spies over plot to bomb industrial and military targets

Wing Commander III changed how the copy hotkey works in Windows 95

Some smart meters won't be smart at all once 2/3G networks mothballed

Protecting distributed branch office environments from ransomware

Your trainee just took down our business and has no idea how or why

UK unions publish AI bill to protect workers from 'risks and harms' of tech

Huawei's latest flagship smartphone contains no world-shaking silicon surprises

Oracle scores big win with Fujitsu Japan for its Alloy partner cloud

Meta lets Llama 3 LLM out to graze, claims it can give Google and Anthropic a kicking

US Air Force says AI-controlled F-16 fighter jet has been dogfighting with humans

Ransomware feared as IT 'issues' force Octapharma Plasma to close 150+ centers

Crooks exploit OpenMetadata holes to mine crypto – and leave a sob story for victims

About Us

Our Websites

Your Privacy