BBC techies have no idea why the load on its database "went through the roof" last weekend, when Auntie was struck by a huge, two-pronged outage that caused its iPlayer service and website to go titsup.
During the downtime, the Beeb was pretty reticent on social media about what had gone wrong, preferring instead to simply post occasional tweets apologising for the disruption and promising to restore access soon.
The silence infuriated some iPlayer fans who demanded more information about when the system would be fixed.
On Tuesday, the BBC's digital distribution controller, Richard Cooper, tried to explain why the Corporation's popular catch-up TV and radio player and its main website had frozen Brits out of accessing the services.
He said in a blog post that the cause of the outage remained a bit of mystery.
@kellyfiveash Basically they still don't know why the database load went through the roof. Worrying.— Ken Tindell (@kentindell) July 23, 2014
The BBC has a system made up of 58 application servers and 10 database servers providing programme and clip metadata, Cooper said.
"This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online," he added. "This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres."
He said that the "load on the database went through the roof" on Saturday morning (19 July), at which point requests for metadata to the application servers started to drop off.
The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.
At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.
Our first priority was to restore the caching layer. The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (“Due to technical problems, we are displaying a simplified version of the BBC Homepage”). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.
The Beeb's techies struggled to restore the metadata service and Cooper added that isolating the source of the additional load on its database had proved to be "far from straightforward". He confessed that "restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems)."
Cooper said that the system remained wobbly throughout the weekend, with the BBC deciding not to further disrupt its service until Monday when fewer people would be accessing the iPlayer.
It finally returned the iPlayer and BBC Online to normal service more than 48 hours after cracks in the system appeared.
Cooper admitted that viewers and listeners may have missed out on certain programmes as a result of the tech blunder.
"I’m afraid we can’t simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced," he said.
Meanwhile, the Beeb is yet to determine exactly what went wrong.
The timing of the outage came just days after the BBC's Internet Blog ran a post from its senior product manager Kiran Patel.
He celebrated the fact that it had been nearly a year since the Corporation announced that its internal Video Factory product - which "moved live processing into the cloud" - was taking over the production of all vid content for the iPlayer.
Ominously, he said: "I cannot promise we will be this fast all the time. There are times when things go wrong and delivery can be delayed. We have built Video Factory with resilience, as its primary goal. So problems may delay delivery, but we ensure we never miss any content."
But it would seem that last weekend's caching and database failures scuppered any chance of viewers and listeners catching up on some of their favourite programmes. ®