Fail and You Last week, users of Google App Engine - Google's application hosting platform - discovered a new feature in the product: downtime. App Engine was offline for roughly six hours, and for much of that time, even the status page which tells users about downtime was unavailable. Now that's a strong way to send a message.
As a reminder: Google App Engine is Google's response to Amazon Web Services. Amazon has set up a scheme where customers can have full access to virtual computers and can also pay for scalability-included services like S3 for storage and a messaging system. It's a fair balance between automatic scaling and control. Google, on the other hand, offers developers a Python and Java API to its database back-end and absolutely zero control over the machines on which your application is running.
App Engine developers must go through the effort to contort their program to Google's data storage mechanism, which in some cases can be a far cry from SQL. The benefit to this is that you don't have to worry about scalability, ever. Allegedly. It's sort of like how a heroin addiction means that you don't have to worry about reality, ever.
As with anything that flies through a cloud, Google App Engine can suffer a double flame-out and crash to the ground, killing hundreds and swearing a large subset of the population off of air travel for quite some time. Google has paying customers for App Engine, and maybe Wonka doesn't quite understand this, but when people pay you for a service, they expect a certain amount of transparency and honesty.
Watching Google's response to the App Engine downtime reminded me of the cruel 2008 US Vice Presidential debates, where everyone watching just wanted to pull Sarah Palin aside and say "Sweetie, this is a grown-up event. You need to use your big-girl words now." Google's explanation for six hours of downtime was basically, "Shit got ill."
The meat of Google's postmortem on the failure was this, a message posted to the App Engine e-mail group: "There was a serious issue in one of App Engine's datacenters with GFS, Google's low level storage system. GFS underlies Bigtable, which in turn underlies App Engine's Datastore. GFS also provides storage for our application serving infrastructure, so GFS unavailability caused problems for Datastore reads and writes, as well as application serving."
Let's say that you were tasked with maintaining the computing platform for your company's web services. After six hours of service outage, your supervisor asked you for an explanation of what happened, and you follow Google's lead. You say, "There was a serious issue with one or more of our computers." Ass, meet curb.
Almost a year ago, Amazon's S3 storage service suffered roughly eight hours of downtime. Amazon's postmortem on the failure included details about specific bugs in their message passing system, and how their wonderfully scalable system could also scale errors quite wonderfully. Amazon identified an oversight in their own code with respect to error checking, so as a customer, you could be sure that somebody is on that shit.
Google, on the other hand, doesn't feel like telling anyone exactly why GFS failed. Was it a bug in the code? Was it a traffic issue? Did Augustus Gloop fall into the chocolate river?
As far as preventing future such failures, Google is equally as tight-lipped. Their postmortem only says this: "The team has been actively working on a solution in the medium-term that would allow us to switchover data centers immediately without consistency problems."
Um, fantastic. When will that be deployed? Why specifically could you not fail over in this case? Do you realize that you're being paid for this? By comparison, Amazon outlined four changes they made to their system, in both code and monitoring, to prevent that type of failure from happening again.
One could argue over the necessity of up-time for the types of apps that appear on App Engine. After all, the world only needs so many RSS readers and Twitter clones. But this highlights the greater risk of hosting your applications in the - and oh it pains me to say this, but for the sake of brevity - cloud. Every time there is downtime like this, be it Google App Engine or Amazon Web Services or Microsoft Azure, tech pundits all tell us that it's not ready for prime time. What the fuck does that even mean? My guess: "My editor wanted me to cover this story and I lack the originality to make any meaningful contribution."
I'll go out on a limb here. Hosting production services on platforms like App Engine is never a good idea. It may be fine for some toy application or a web service that you never plan to make any money from, but when your livelihood depends on it, will you really trust the business to a company whose failure response is the technical version of "whoooa, sorry bro, my bad?" Clearly, you can draw a line when it comes to outsourcing. But for serious business, if you can't put your hands on the metal - or order someone else to put their hands on the metal - then you're due for an embarrassment. And it's nobody's fault but your own.
Google's sell really appeals to the engineers, but I hope that the decision makers can see through the bullshit. Automatic scalability? Really? Or did you guys drop too much capital expenditure on machines and have to come up with a way to make a return on that investment? Maybe it's less of a tell than I think it is, but the App Engine main product page has a prominent link to the terms of service at the top, and no link or contact information for support. Google's introverted population certainly knows that it's easier and cheaper to legalese your way out of a customer's problem than it is to hire a person to pick up the phone.
Man, supporting real people is way harder than selling ads. ®
Ted Dziuba is a co-founder at Milo.com You can read his regular Reg column, Fail and You, every other Monday.