RDBMS grows up
The roots of competition with RDBMS are already there: Hadoop Pig is a data analysis program and Hive a data warehousing system. Cutting says parts of data warehousing and ETL might well be subsumed as these grow.
You laugh? Well, it’s not like RDBMS owns data analysis, data warehousing or ETL; these have only became synonymous with RDBMS because the software giants – Microsoft, Oracle and IBM – have poured vast sums of money into developing the tools that added value to their software by allowing them to take more workloads off the mainframe and midrange systems of the day. “There might be some battles where some existing technologies get incorporated into the big data sack,” Cutting says.
Co-existence, though, is the watchword as Cutting reckons Hadoop can find a new niche working with RDBMS. “The sweet spot for the growth is not to try to displace things like that but rather to attack problems people are having when they use those things. Over time the technologies could creep upstream and erode neighbouring technologies but, right now, there’s enough new stuff to keep us busy.”
'It’s getting better,' Baldeschwieler said of Hortonworks’ relationship with Cloudera. 'There’s name-calling in all open-source projects.'
The RDBMS giants, it seems, agree with Cutting's view that Hadoop won't compete for enterprise transactional loads such as payroll and inventory.
The other big reason Microsoft and Oracle have lowered their defences is that Hadoop will actually reinforce their positions and the position of RDBMS. Hadoop will bring larger numbers of developers to their databases. These devs will build applications for big data and the web that – in part – will use information held in RDBMS. Hadoop is built using open-source Java while the Avro project allows compiling in Java, C, C++, C sharp, Python and Ruby.
The big companies might be buying into Hadoop right now, but they could still pose problems and could help contribute to some kind of fragmentation down the road. That's because the RDBMS giants have taken sides: Microsoft has chosen to work with Hortonworks, formed in June 2011 with the engineering team who’d worked with Cutting on Hadoop but who’d remained at Yahoo!. IBM and Oracle, meanwhile, have gone with Cloudera, where Cutting is architect.
Cloudera and Hortonworks implement different modules of Hadoop in their distributions. You can compare the full list for Cloudera CDH here and Hortonworks' Data Platform here (warning PDF). Hortonworks has also added third-party software to its module mix: Talend's Open Studio.
There has already been tension between Cutting’s company and Hortonworks. They got in to a bloggy spat last year over who contributed which code to Apache (fixes versus new features).
At the time of the spat, Baldeschwieler told The Reg this was was business as usual for open source but he reckoned Hortonworks and Cloudera are united on the common cause of improving Hadoop. “It’s getting better,” Baldeschwieler said of Hortonworks’ relationship with Cloudera, adding: “There’s name-calling in all open-source projects...
“At this point,” he continued, “there’s an obvious consensus that Cloudera and Hortonworks are equally focused on making Hadoop better.”
Baldeschwieler called Hortonworks' partnership with Microsoft “an example of building a strong relationship to take Hadoop to more customers.”
Cutting hopes Hadoop can stay united using BigTop, which acts as a kind of reference model. BigTop is an Apache project that integrates core Hadoop with the Zookeeper, HBass, Hive, Pig, Mahout, Oozie, Sqoop, Flume and Whirr modules and with versions of Fedora, CentOS, Red Hat Enterprise Linux and SuSE Linux Ubuntu. The basis of BigTop is Cloudera’s CDH and the idea is that all future versions of CDH will come from BigTop.
Cutting reckons BigTop will align CDH to the official Apache Hadoop project. While Cloudera has tried to align CDH to Apache releases, gaps have sometimes emerged as bugfixes and some features are back-ported to CDH that correspond to earlier versions of the Apache release, Cutting said.
Cutting doesn’t see fragmentation as a big problem but did note that fewer versions of Hadoop would be a good thing – it would make things easier for developers and, presumably, better for firms like Cloudera, which hopes to build a Hadoop business based upon the distro which the market likes the most. “Fewer distributions would make it easier for developers, since they'd have fewer combinations of versions to support,” Cutting said.
Coding inside the BigTop
To mean anything, however, BigTop, will need everybody – not just Cloudera – to support its development and to swallow up the code base into their distros. Currently, BigTop is rather Cloudera-centric. According to this blog post, and based on the Hortonworks HDP data sheet (warning PDF), Hortonworks is only "parts” of BigTop.
“We're hoping that other distributors will join us in collaborating on BigTop to further reduce any such fragmentation issues,” Cutting told The Reg. “We currently have lots of folks collaborating well on these projects who don't share a distribution. Ideally we'll all start collaborating through BigTop and the various distributions will interoperate easily.”
BigTop comes as Cutting reckons further changes are needed to help Hadoop hit its potential – and establish that footing it aims for in mainstream IT. Changes won't be big, he says. Instead, they will be refinements.