This article is more than 1 year old
Scalr hosting hit with outage
Server records deletion causes website woes
Cloud management biz Scalr.com was yesterday hit by an outage which knocked customer websites offline.
In a post, the company said the problem had been due to an update that included flawed logic, causing server records to be removed but not AWS servers or other cloud servers.
One customer got in touch this morning to complain the company had yet to issue an apology or discuss compensation.
He said: "Yesterday, one of our customer websites stopped working as the database server was no longer contactable.
"It turns out that Scalr [had] accidentally removed all their DNS entries, resulting in another server being spawned for us in AWS, without our knowledge, even though there was no issue with the existing server.
"This obviously cost us, no just [in] lost revenue and downtime, but [also because] we were then paying for a second server we did not need," he claimed.
The Register has contacted Scalr for a comment.
Other customers complained on Twitter:
A great big thank you to @scalr who tried to terminate all our servers yesterday. Great job!— Oliver Sale (@oliversale) July 19, 2016
Thank you @scalr for attempting to terminate all our servers. What a mess!— MailBigFile (@mailbigfile) July 18, 2016
Updated at 11:32 UTC, 19 July to add: Sebastian Stadil, Scalr's CEO, said the company had apologised privately to all its customers and intends to post a further statement on its website.
He said: "Users have every right to be angry and voice their frustration. I know I've been frustrated when services I use are down, so I can sympathise."
He added: "Operations professionals know that outages come with the territory. In fact, the 'blameless' devops culture comes from this: learn from mistakes, improve, share your experience. Tomorrow's statement will do just that. Explain what happened, how we handled it, what we're doing to improve, and what others can learn from this experience.
"Lastly, I want to point out that everyone in the team behaved admirably yesterday, and some worked 30 hours straight to get this resolved for every customer, with the first resolutions by within three hours of detection. I myself have been up for 22 hours and am about to get some rest."