Facebook warehousing 180 PETABYTES of data a year

The Social Network open-sources ‘Corona’ tool used to manage the deluge


Facebook’s data warehouses grow by “Over half a petabyte … every 24 hours”, according to an explanatory note The Social Network’s Engineering team has issued to explain a new release of open source code.

The note says the warehouse performs "ad-hoc queries, data pipelines, and custom MapReduce jobs process this raw data around the clock to generate more meaningful features and aggregations."

But vanilla-flavoured Apache Hadoop can't do that job, so Facebook has created the code in question, dubbed Corona, to extend the big data darling's capabilities so it can manage the deluge of data it collects each day.

The note explains “We initially employed the MapReduce implementation from Apache Hadoop as the foundation of this infrastructure, and that served us well for several years. But by early 2011, we started reaching the limits of that system.”

Those limits saw compute clusters clogged, due to scheduling issues with MapReduce, while resource management struggled to meet Facebook’s enormous demands.

Facebook characterises MapReduce, Hadoop-style, with the following illustration.

Facebook's depiction of Hadoop at work

Corona, by contrast, offers the configuration depicted below.

Facebook's Corona tool

Facebook says Corona rocks for the following reasons:

“Corona introduces a cluster manager whose only purpose is to track the nodes in the cluster and the amount of free resources. A dedicated job tracker is created for each job, and can run either in the same process as the client (for small jobs) or as a separate process in the cluster (for large jobs). One major difference from our previous Hadoop MapReduce implementation is that Corona uses push-based, rather than pull-based, scheduling. After the cluster manager receives resource requests from the job tracker, it pushes the resource grants back to the job tracker. Also, once the job tracker gets resource grants, it creates tasks and then pushes these tasks to the task trackers for running. There is no periodic heartbeat involved in this scheduling, so the scheduling latency is minimized.”

The post also details how Facebook introduced the new tool and, along the way, gives some insights into the scale of the company’s infrastructure with the revelation rollout started with a modestly-sized cluster of 500 nodes, to “get feedback from early adopters.”

A 1000-node trial yielded the first scaling problem, before the tool was introduced to all of the company’s servers.

The company has now made Corona available, on github. By doing so it has played by the right open source rules, given that the Engineering note suggests the company believes Corona will be a crucial tool for “for years to come”.

Given the note says Facebook’s data warehouse “has grown by 2500x in the past four years” Corona looks to have serious data-handling grunt. And that’s just the warehouse: how much other data Facebook holds is not disclosed. Nor is just what Corona will deliver, in terms of products or data analysis.

It may therefore be sensible, if one were to relax and partake of Corona’s namesake beverage, to admire the technical achievements described here, but to reserve judgement on what they may enable. ®

Similar topics

Narrower topics


Other stories you might like

  • Venezuelan cardiologist charged with designing and selling ransomware
    If his surgery was as bad as his opsec, this chap has caused a lot of trouble

    The US Attorney’s Office has charged a 55-year-old cardiologist with creating and selling ransomware and profiting from revenue-share agreements with criminals who deployed his product.

    A complaint [PDF] filed on May 16th in the US District Court, Eastern District of New York, alleges that Moises Luis Zagala Gonzalez – aka “Nosophoros,” “Aesculapius” and “Nebuchadnezzar” – created a ransomware builder known as “Thanos”, and ransomware named “Jigsaw v. 2”.

    The self-taught coder and qualified cardiologist advertised the ransomware in dark corners of the web, then licensed it ransomware to crooks for either $500 or $800 a month. He also ran an affiliate network that offered the chance to run Thanos to build custom ransomware, in return for a share of profits.

    Continue reading
  • China reveals its top five sources of online fraud
    'Brushing' tops the list, as quantity of forbidden content continue to rise

    China’s Ministry of Public Security has revealed the five most prevalent types of fraud perpetrated online or by phone.

    The e-commerce scam known as “brushing” topped the list and accounted for around a third of all internet fraud activity in China. Brushing sees victims lured into making payment for goods that may not be delivered, or are only delivered after buyers are asked to perform several other online tasks that may include downloading dodgy apps and/or establishing e-commerce profiles. Victims can find themselves being asked to pay more than the original price for goods, or denied promised rebates.

    Brushing has also seen e-commerce providers send victims small items they never ordered, using profiles victims did not create or control. Dodgy vendors use that tactic to then write themselves glowing product reviews that increase their visibility on marketplace platforms.

    Continue reading
  • Oracle really does owe HPE $3b after Supreme Court snub
    Appeal petition as doomed as the Itanic chips at the heart of decade-long drama

    The US Supreme Court on Monday declined to hear Oracle's appeal to overturn a ruling ordering the IT giant to pay $3 billion in damages for violating a decades-old contract agreement.

    In June 2011, back when HPE had not yet split from HP, the biz sued Oracle for refusing to add Itanium support to its database software. HP alleged Big Red had violated a contract agreement by not doing so, though Oracle claimed it explicitly refused requests to support Intel's Itanium processors at the time.

    A lengthy legal battle ensued. Oracle was ordered to cough up $3 billion in damages in a jury trial, and appealed the decision all the way to the highest judges in America. Now, the Supreme Court has declined its petition.

    Continue reading

Biting the hand that feeds IT © 1998–2022