This article is more than 1 year old
Google Cloud (over)Run: How a free trial experiment ended with a $72,000 bill overnight
Billing budget? Free plan? All useless when buggy code went into overdrive
Sudeep Chauhan, founder of startup Milkie Way, suffered a bad case of bill shock when a test with a $7.00 billing budget and a free database plan on Google Cloud platform (GCP) generated a $72,000 invoice overnight.
"I jumped out of the bed, logged into Google Cloud Billing, and saw a bill for ~$5,000," Chauhan wrote on his company's blog. "Super stressed, and not sure what happened, I clicked around, trying to figure out what was happening. I also started thinking of what may have happened, and how we could possibly pay the $5K bill. The problem was, every minute the bill kept going up. After two hours, it settled at a little short of $72,000."
It was especially surprising that it happened to Chauhan, who is ex-Google and even spent two years as a payments technical program manager. What happened?
The idea was to build a system that scraped web pages and stored the results in a database. His team picked Google Cloud Run, a GCP service that runs containers, for the job. They then found their code in each instance would timeout and stop as it scraped one page after the other. So, they set up a many-instance system that processed pages in parallel to get each page fetched and stored within the run-time limit.
Devs invited to bake 'Run on Google Cloud' button into git repos... By Google, of courseREAD MORE
Chauhan wrote: "To overcome the timeout limitation, I suggested using POST requests (with URL as data) to send jobs to an instance, and [to] use multiple instances in parallel instead of using one instance serially. Because each instance in Cloud Run would only be scraping one page, it would never time out, process all pages in parallel (scale), and also be highly optimized because Cloud Run usage is accurate to milliseconds."
The ex-Googler reflected that he missed the possibility of pages that link back to each other, causing "infinite recursion." It should not have mattered too much, though: he set a billing budget of $7.00 and had a Firebase database on a free plan. "The worst case we imagined was exceeding the daily free Firestore limits," he said. Further, the credit card for the account had a spending limit of $100.
Unfortunately, a billing budget "does not automatically cap Google Cloud or Google Maps Platform usage/spending," according to the docs.
While Chauhan was asleep after a day of testing, Google sent an automated email informing him that his free Firebase plan had been "upgraded due to activity in Google Cloud," and that this "initiated billing" for the project.
He discovered multiple issues with the GCP cost controls. "Billing takes about a day to be synced, and that's why we noticed the charges the next day," Chauhan said. Next, the "Firebase Dashboard took more than 24 hours to update," he said. This meant that the dashboard showed usage within the daily limit, when it was, he said, "86 million percentage points" more than what was shown.
Billing takes about a day to be synced, and that's why we noticed the charges the next day
The GCP Cloud Run defaults also played their part. "The max-instances is preset to 1,000, and concurrency set to 80," he said. If he had corrected this to small values like 2 and 1, the bill shock would not have occurred.
Thanks to these settings, "running [out] this version of Hello World deployment on Cloud Run made 116 billion reads and 33 million writes to Firestore," said Chauhan.
Most of the cost was down to Firebase read operations, even at just $0.06 per 100,000. Multiply that by 116 billion and you get $69,600. There was also the small matter of 16,000 hours of Cloud Run Compute time, partly because the application did not delete the services but left them "in background process".
The performance of the buggy code was impressive in its way. "At the peak, Firebase was able to handle about one billion reads per minute," he said, while Cloud Run with concurrency "can handle 9 million requests per minute".
"Fail fast, learn fast with cloud is a bad idea," Chauhan concluded. "If you count the number of pages in GCP documentation, it's probably more than pages in [a] few novels. Understanding pricing, usage, is not only time consuming, but requires a deep understanding of how cloud services work."
There is a happy ending. "After going through our lengthy doc on this incident sharing our side of the story, various consults, talks, and internal discussions, Google let go of our bill as a one-time gesture," said Chauhan.
Such leniency cannot be relied upon. Auto-scaling and on-demand computing has downsides, and working out what something will cost is challenging. Caution is advised. ®