It comes as no surprise that VMware wants companies to run everything virtually rather than on bare metal, and for several years it has pushed the idea of virtualizing the Hadoop stack to make it run better and easier to manage. The tool it created to do that, called Project Serengeti, now has some feature tweaks to try to entice more big data cluster builders give it a whirl.
With Serengeti 0.8.0, released Tuesday, the open source tool for virtualizing Hadoop now supports a number of new Hadoop releases plus adds features to make it easier to set up HBase data warehouses on top of Hadoop.
The update to Seregenti was announced in a blog post by Richard McDougall, principal engineer in the office of the CTO at the virtualization giant. "Most big-data environments consist of a mix of workloads," McDougall explains. "Serengeti's mission is to enable as many of the big-data family of workloads into the same theme park, all running on a common shared platform."
By virtualizing clusters you can run various parts of the big-data munching tools on shared hardware, dialing up virtual machines running each workload as needed, and dialing them back so other workloads can play.
It's all about elastic scaling, for which you pay a virtualization performance tax. For many workloads, as servers have been crammed to the gills with cores, this overhead has been acceptable.
VMware wants to layer big data tools on top of its ESXi server virtualization
Most companies probably don't think about their Hadoop clusters in this manner, and very likely do think about them as performing very specific functions. They're more worried about the turnaround time for batch jobs and queries and how other applications are dependent on the results of that work, and they don't want to pay a performance overhead for virtualization.
But VMware is going to keep plugging away at the idea that virtualization will allow for mixed-mode use of server clusters for all kinds of big-data jobs. So will the Pivotal group once Serengeti passes along with the Cloud Foundry platform cloud and EMC's Greenplum data warehouse and Hadoop distribution over to the Pivotal spinoff sometime later this year.
With the Serengeti 0.8.0 release, Cloudera's CDH4 and MapR Technologies' M5 Hadoop distributions are now supported running inside of virtual machine containers. The open source Apache 1.0 distribution was already supported, as was EMC's Greenplum HD 1.2., Cloudera CDH3, and Hortonworks Data Platform 1.0.
With the CHD4 release, Serengeti is aware that you can use the HDFS1 or HDFS2 file systems, and is also aware of the federated NameNode support that Cloudera has built into its Hadoop distro and knows how to configure these options.
And with MapR distros, Serengeti is similarly aware of the container location database (CLDB) used in the NFS-alike file system that MapR uses instead of HDFS, and is also in the know about the FileServer, JobTracker, and TaskTracker elements of the MapR stack, and how to package these up into virty machines and scale out their performance by replicating copies.
If you are looking to set up an HBase data warehouse, as you can see in the Serengeti 0.8.0 release notes, the VMware tool can create an HBase cluster, with an underlying HDFS file system and linked to the MapReduce data-muncher and the Thrift and RESTful APIs that are used to control HBase.
Serengeti also knows how to configure active and hot standby replicants of the HMaster nodes for the data warehouse, and can scale out HBase RegionalServers once the data warehouse is set up atop HDFS. HBase can be deployed in a virtualized manner by Serengeti on top of the Apache Hadoop. Cloudera, Hortonworks, or Greenplum distros – but not MapR distros, for some reason.
You can download the virtual machine appliance stuffed with Serengeti 0.8.0 here at the VMware site, and it doesn't cost anything to use. ®