Facebook tops up Apache Project graph database with fresh code

'You know what's cooler than a billion edges? A trillion edges!'


Facebook has shoved code back into the trunk branch of Giraph, an open source graph-processing Apache project that mimics Google's advanced "Pregel" system.

The upgrades let Giraph process graphs with trillions of edges – the connections between entities in a graph database – and were announced by the company in a blog post on Wednesday, in which engineers also explained why they chose to bring Giraph into the social network's software ecosystem, and how they added to it to let it deal with larger graphs in a less memory intensive way.

Giraph is an implementation of Google's Pregel database, which the Chocolate Factory built to let it mine its vast array of datapoints and spot valuable interconnections. The company published information on Pregel in June, 2009.

Facebook uses Giraph to help it analyse its massive social network, and decided to upgrade its technology in the summer of 2012. By analyzing the data contained in the connections between its peons users and brands and groups, Facebook can almost certainly develop better tools to offer its advertisers.

"Analyzing these real world graphs at the scale of hundreds of billions or even a trillion (10^12) edges with available software was impossible last year. We needed a programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets. After the improvements described in this article, Apache Giraph provided the solution to our requirements," the engineers wrote.

The company also evaluated Apache Hive, GraphLab, and Apache Giraph, but plumped for Giraph due to the fact it runs as a MapReduce job, and is written in Java and so can interface well with Facebook's Java stack.

The main contribution Facebook made to the technology was the implementation of multi-threading, which improves the performance of Giraph.

"When Giraph takes all the task slots on a machine in a homogenous cluster, it can mitigate issues of different resource availabilities for different workers (slowest worker problem)," the company wrote. "For these reasons, we added multithreading to loading the graph, computation (GIRAPH-374), and storing the computed results (GIRAPH-615)."

By implementing multithreading, the company has seen linear speed up in some CPU bound applications.

The company has also reduced the overall memory footprint of the system, which in earlier iterations was a "memory behemoth".

It achieves this by serializing vertexs into a byte array rather than a java object, and serializing messages on the server. By doing this the company also gained a predictable memory model for vertexes, which let it better figure out resource consumption by the tech.

"Given that there are typically many more edges than vertices, we can roughly estimate the required memory usage for loading the graph based entirely on the edges. We simply count the number of bytes per edge, multiply by the total number of edges in the graph, and then multiply by around 1.5x to take into account memory fragmentation and inexact byte array sizes."

The company also made enhancement to the aggregator architecture of the technology to remove bottlenecks that had formed when processing large amounts of data.

These improvements have dramatically improved the performance of Giraph, Facebook says, allowing it to run an iteration of page rank on a one trillion-edge social graph – the largest test Giraph has ever undergone.

"The largest reported real-world benchmarked problem sizes to our knowledge are the Twitter graph with 1.5 billion edges... and the Yahoo! Altavista graph with 6.6 billion edges; our report of performance and scalability on a 1 trillion edge social graph is 2 orders of magnitude beyond that scale."

Few companies have to deal with graphs with trillions (or even billions) of edges for now, but as technologies like the internet of things are deployed widely and seas of sensors start beaming data into massive data stores, the tech will become increasingly relevant to organizations other than social networks, ad slingers (Google), and ecommerce shops (Amazon). ®

Similar topics


Other stories you might like

  • Prisons transcribe private phone calls with inmates using speech-to-text AI

    Plus: A drug designed by machine learning algorithms to treat liver disease reaches human clinical trials and more

    In brief Prisons around the US are installing AI speech-to-text models to automatically transcribe conversations with inmates during their phone calls.

    A series of contracts and emails from eight different states revealed how Verus, an AI application developed by LEO Technologies and based on a speech-to-text system offered by Amazon, was used to eavesdrop on prisoners’ phone calls.

    In a sales pitch, LEO’s CEO James Sexton told officials working for a jail in Cook County, Illinois, that one of its customers in Calhoun County, Alabama, uses the software to protect prisons from getting sued, according to an investigation by the Thomson Reuters Foundation.

    Continue reading
  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading

Biting the hand that feeds IT © 1998–2021