A failed storage controller caused a protracted outage at hosted desktop and cloud slinger Vesk - not that this factoid has made its way onto the company’s website, where it boasts of 100 per cent uptime for the past 1,583 days.
Vesk, acquired by London-listed Nasstar plc in October, has written to customers in a bid to explain service problems that first showed up at lunchtime on 26 August and lasted into the wee hours of 27th.
“We suffered from a system failure which resulted in loss of access to emails and certain dedicated instances hosted on the same platform,” the company stated in a letter to customers, seen by El Reg.
The monitoring platform noted a rise in server resource consumption and began troubleshooting. The infrastructure team were then alerted to a “storage controller failure” as users reported Outlook and Email wobbles.
Specifically, a failed hard disk in the Storage Access Network caused a “panic event” on the primary controller that triggered a failover between two storage controllers.
The storage fail led to a “split brain event” and subsequent “levels of corruption within each virtual desk as they were being served by independent controllers,” Vesk said in the letter.
To repair the corruption, the platform was taken down at the end of normal working hours, “there were however, too many clusters of corrupted bad blocks to repair, and the timeframes indicated that the process would take many days to complete.”
The decision was taken at 11pm to invoke the DR plan and “failover to a secondary data centre site”.
All affected services were dragged online again early the next morning, though some of Vesk’s Exchange and dedicated SharePoint databases had "failed to start".
"As a result of our documented disaster recovery plans and procedures, were able to keep the doiwntime to a minimum for the majority of the affected environments," the company claimed.
Vesk said it is reviewing the configuration applied on the storage controllers to figure out how to introduce further fail-safes in the future and arrange a plan to switch back to the primary DC from the DR platform.
On its website, Vesk claimed it had had 100 per cent uptime for all of 2012, ’13, ’14, ’15 and even 2016, despite us still having a full quarter of the year to go. ®