Atlassian comes clean on what data-deleting script behind outage actually did
Day 10 of ongoing disruption: Have some sympathy for admins, it's a difficult read
Who, Us? Atlassian has published an account of what went wrong at the company to make the data of 400 customers vanish in a puff of cloudy vapor. And goodness, it makes for knuckle-chewing reading.
The restoration of customer data is still ongoing.
Atlassian CTO Sri Viswanath wrote that approximately 45 percent of those afflicted had had service restored but repeated the fortnight estimate it gave earlier this week for undoing the damage to the rest of the affected customers. As of the time of writing, the figure of customers with restored data had risen to 49 percent.
As for what actually happened… well, strap in. And no, you aren't reading another episode in our Who, Me? series of columns where readers confess to massive IT errors.
"One of our standalone apps for Jira Service Management and Jira Software, called 'Insight – Asset Management,' was fully integrated into our products as native functionality," explained Viswanath, "Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed."
Two bad things then happened. First, rather than providing the IDs of the app marked for deletion, the team making the deactivation request provided the IDs of the entire cloud site where the apps were to be deactivated.
The team doing the deactivation then took that incorrect list of IDs and ran the script that did the 'mark for deletion magic.' Except that script had another mode, one that would permanently delete data for compliance reasons.
You can probably see where this is going. "The script was executed with the wrong execution mode and the wrong list of IDs," said Viswanath, with commendable honesty. "The result was that sites for approximately 400 customers were improperly deleted."
- At last, Atlassian sees an end to its outage ... in two weeks
- Day 7 of the great Atlassian outage: IT giant still struggling to restore access
- Atlassian outage lingers, sparking data loss fears
- Atlassian flags Bitbucket and Confluence Data Center flaws
The good news is that there are backups, and Atlassian retains them for 30 days. The bad news is that while the company can restore all customers into a new environment or roll back individual customers that accidentally delete their own data, there is no automated system to restore "a large subset" of customers into an existing environment, meaning data has to be laboriously pieced together.
The company is moving to a more automated process to speed things up, but currently is restoring customers in batches of up 60 tenants at a time, with four to five days required end-to-end before a site can be handed back to a customer.
"We know that incidents like this can erode trust," understated Viswanath.
Viswanath's missive did not mention compensation for businesses suffering a lengthy outage other than stating he and his team were committed to "doing what we can to make this right for you."
The Register contacted the company to clarify what this includes and will update should Atlassian respond.
With many other companies not being this transparent, especially at the point while the problem is still ongoing, it's commendable to get a proper explanation. ®