Tumblr tumbles down, stays there for 24 hours
Maintenance was planned, cluster f*ck less so
Tumblr has blamed a database problem for an outage that left the popular microblogging service largely unavailable for more than 24 hours.
The service was restored on Monday afternoon, allowing users to resume their postings of pictures of cats and musings on life, as is the local custom.
In an update, Tumblr founder David Karp apologised for the outage, which he blamed on a technical glitch.
Yesterday afternoon, during planned maintenance that was not intended to interrupt service, an issue arose that took down a critical database cluster. This brought down our entire network while our engineers worked feverishly to restore these databases and bring your blogs back online.
Karp admitted the Sunday/Monday outage was just the most serious glitch in a larger series of service problems that the micro-blogging platform has experienced of late. He said that the site had quadrupled its engineering team and was in the process of rolling out a more distributed architecture in a bid to make it more robust.
Website availability problems in the absence of a denial of service attack can usually be traced back to one of three problems or some combination thereof: insufficient bandwidth, poor code or not enough server horsepower to cope with demand. Failure to run a distributed system with well designed failover and backup invites trouble, especially for high demand sites, as Tumblr discovered this week.
Last month 4chan and Tumblr users engaged in an online spat that escalated to involve denial of service attacks on each side. Tumblr came off worse in the rumpus, which was apparently triggered by accusations that Tumblr users were stealing jokes from 4chan without crediting the source. ®