This article is more than 1 year old
SourceForge staggers to feet after lengthy STORAGE FAIL outage
Warning: Dev services STILL COMATOSE until mid-week
Popular open source code-hosting repository SoureForge has been battling a significant outage for days and is slugglishly recovering from a lengthy Total Inability To Support Usual Performance (TITSUP) drama.
On Thursday, the site slipped into "disaster recovery mode" and since then it has been a torrid time for sysadmins scrabbling to restore the GitHub-like portal.
It shares infrastructure with Slashdot, which was also affected by the downtime. SourceForge reiterated that the sites had gone titsup due to a storage fault.
The Slashdot Media sites experienced an outage commencing last Thursday. We responded immediately and confirmed the issue was related to filesystem corruption on our storage platform.
This incident impacted all block devices on our Ceph cluster. We consulted with our storage vendor when forming our next steps. We have since been working 24×7 on data restoration, data validation, and service recovery.
Our response to date has been methodical and focused on safe restoration of data and service.
To enable this response we split our team in half, with one portion of the team working to expedite service restoration, and one portion of the team working on data validation and restoration.
While most of the site's projects have now been restored, SourceForge was yet to bring its developer data fully back to life.
#SourceForge directory, download and project summary pages are back online; dev services (SCM, uploads, ML's, project web) pending restoral— SF.net Operations (@sfnet_ops) July 18, 2015
"We’ll be bringing services back online as the validation of backing data is completed, and anticipate bringing additional services online through mid-week," SourceForge added.
"The data involved in our developer services is among the largest we house, and it takes time to perform filesystem checks and to restore data from backups. Using separate mounts, both steps are occurring concurrently to minimise the timetable for restore."