Yahoo! defies Facebook with Hadoop SQL dupe
Open-source disharmony in stuffed elephant land
Hadoop Summit Much to the chagrin of Facebook, Yahoo! is developing its own SQL-like language for Hadoop, the open-source distributed data-crunching platform that's well on its way to conquering the planet.
Facebook has already developed and open-sourced its own Hadoop SQL, known as Hive. But Yahoo! says it needs a Hive alternative that's better suited to moving its own back-end data onto Hadoop.
"We looked at [Hive], and for the type of problems we're solving, it didn't work quite as well for us," Yahoo! senior vice president for engineering, cloud computing, and data infrastructure told The Reg today at the annual Hadoop Summit in Santa Clara, California.
"One of the internal platforms we're moving to Hadoop uses a version of SQL, and in order to make the migration bit easier. It made more sense to build our own [Hadoop SQL]."
Facebook is bemused. "We've tried to convince them to use [Hive]," Facebook engineering manager Ashish Thusoo told us. "I don't know why they're doing this."
Hadoop mimics Google's MapReduce framework, which maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. You can write straight to the framework in Java, but Hive and other languages let you code at a higher level. Less experienced developers can build apps in a fraction of the time - and with a fraction of the code.
Yahoo! has already built and open-sourced a Hadoop language known affectionately as Pig, which sits somewhere between low-level MapReduce code and the much higher level of Hive. Now, it wants to provide its internal developers with a Hive-like option - but not Hive itself.
Hadoop already offers a second SQL-like language, known as CloudBase, but this option never quite made it in the real world. Meanwile, Hive is widely used by Facebook itself and countless other outfits, including ad-obsessed outfits like AdMob and Adknowledge. At Facebook, it's used to crunch data for everything from the site's Google Trends-like Lexicon tool to, yes, its ad placement system.
"We've been doing this for about a year and a half or two years, and we're open source," Facebook software engineer Joydeep Sen Sarma told us. "We're much further out in terms of developing a classic developing environment. They're playing catch-up. It will take them time to support all our functionality that we have built-in."
Basically, Yahoo! is taking SQL and placing it on top of Pig. And on some level, the Facebookers understand why Yahoo! would do so, but in the end, they vote for Hive harmony. "Hive solves exactly the problems they are trying to solve," Sen Sarma said. "And it's being used by large customers of SQL software...This was a brilliant opportunity to collaborate, and we would have embraced it."
Yes, Pig predates Hive. So, in developing its Hadoop SQL, Facebook took its own path as well. But, Thusoo told us, it was trying to do something very different from Pig. For one thing, Pig operates at a lower-level. Plus, when first developed, it couldn't handle scripts written in other languages. Hive was designed from embedded scripts from the beginning.
"Pig is both an imperative and a declarative language," Thusoo told us. "But our philosophy was: If you're going to do declarative, why not use SQL? And why not let people embed scripts in the language of their choice in the SQL?" Since then, Pig has embraced such scripting.
Yahoo! has not open-sourced its Hadoop SQL, but plans to. "It will somewhat suit our needs for migration, but it's relevant to anyone - so folks will have a choice," Yahoo!'s Shugar said.
No, it doesn't have a name. But in classic Hadoop fashion, it will undoubtedly evoke some sort of fauna. Famously, Hadoop is named for a yellow stuffed elephant. ®