Comment Platform-as-a-service Heroku spent two years misleading customers because it was so focused on building a new product that it didn't bother to update old documentation – this is unacceptable.
The Salesforce-owned company admitted on Wednesday that customers using its 'Bamboo" application automation technology could have spent two years paying for a service that used different tech to that found in Heroku's docs, and that this tech entailed poor app performance that was nearly impossible for both Heroku and Heroku analytics partner New Relic to detect.
An equivalent situation would be a logistics company announcing it had not informed customers that it had shifted from using a central dispatcher to direct its trucks, to a system where each driver had a list of all packages and had to work out the delivery order with other drivers with no central command. Moreover, that this change meant packages were not getting delivered on time and that it had no way of informing customers of the tardiness of their packages.
Here's what Heroku said on Wednesday in an FAQ document responding to the customer concerns* provoked by its undocumented shift from intelligent routing to random routing:
Since early 2011, high-volume Rails apps that run on Heroku and use single-threaded web servers sometimes experienced severe tail latencies and poor utilization of web backends (dynos). Lack of visibility into app performance, including incorrect queue time reporting prior to the New Relic update in February 2013, made diagnosing these latencies (by customers, and even by Heroku’s own support team) very difficult.
Q. Were the docs wrong?
A. Yes, for Bamboo. They were correct when written, but fell out of date starting in early 2011. Until February 2013, the documentation described the Bamboo router only sending one connection at a time to any given web dyno.
Q. Why didn’t you update Bamboo docs in 2011?
A. At the time, our entire product and engineering team was focused on our new product, Cedar. Being so focused on the future meant that we slipped on stewardship of our existing product.
Cloud computing lives or dies on the trust a consumer can have in its provider, and for a company to make a mistake of this magnitude is unacceptable. Yes, documentation errors happen, but the negative effects they can have on customers can be much more pernicious in a service provider environment.
"Yes, we clearly messed up here, and we're paying for it in the form of community wrath and upset customers," Heroku CTO Adam Wiggins told us via email. "We're doing what we can to set it right with the community (full disclosure and open communication), and with our customers (working directly with them to get their app's performance to a good place, or anything else they want)."
Apologizing to affected customers and helping them move their applications to the new software systems is all well and good, but it doesn't deal with the lasting damage this sort of thing does to the perception of cloud computing.
When developers provision services from a cloud provider, they are outsourcing a logistics and/or business-to-business function. The proposition is that they no longer have to care (much) about the technology the provider uses to do the service, and they get to spend more time doing whatever it is their product derives value from.
In exchange the provider delivers the agreed upon service in a timely manner, while using its domain expertise to either improve performance or reduce prices.
In Heroku's case, the company offered a service early on in its life that it found it could not operate as it grew.** It didn't tell customers this and in fact moved to a less sophisticated technology (random routing) to deal with the influx of customers that came to it for its cloud services capabilities.
By doing this it misled customers and even exposed some of them to degraded application performance. For some, Heroku's cloud got worse over time, rather than better. It's like Intel producing 'reverse-Moore's Law' chips with capabilities whose efficiency halved and pricing doubled every 18 months for a certain subset of customers without telling them.
Cloud computing is meant to bring about greater efficiency through the pooling of resources and expertise. When it works, it's a distillation of the finer features of capitalism. When it doesn't work - as in this case - it makes fools of its consumers and damages other companies in its space.
If Heroku (founded: 2007), got so carried away with new products that it made a significant change in 2011 and then forgot for two years to tell anyone until a customer wrote a blogpost lambasting the company, then wouldn't it be fair to surmise that other companies could have done the same thing?
Heroku has made changes to make sure that it can never make a foul-up of this magnitude again. "Nowadays we have a whole team devoted to documentation via our Dev Center, and we treat docs as a first-class citizen," Wiggins said.
Trust is a valuable commodity, and all companies would do well to avoid cheapening it through sloppiness paired with brash ambition. ®
* The routing change was highlighted by disgruntled Heroku customer Rap Genius in February.
** There's nothing wrong with changing technologies when you run into unforeseen problems. Heroku is not alone in finding routing to be a difficult technology to scale - cloud giant Amazon also tried to shift from random routing to a more intelligent routing method in the mid-2000s but ran into huge problems and eventually reverted to using standard Cisco gear, according to a post on popular developer hangout Hacker News. The problem is not telling customers that you've made the change.