Hadoop might be a popular tool for munching on unstructured data, but setting up and tuning the software requires a lot more expertise than many people have and it takes a lot of time, too. That makes it a perfect piece of software to put on a cloud, provided you can either generate your data there to begin with or pipe it over there once you gin it up.
If you want to run Hadoop on a cloud, you can either buy raw server and storage capacity on Amazon's EC2 cloud and set it up yourself or you can use the Elastic MapReduce service on Amazon's cloud. Another alternative is to spin up Hadoop services on Microsoft's Azure cloud, which is in tech preview and which is expected to be commercially available soon.
And starting today, if you are feeling lazy or just experimenting, you have another option. Skytap, the other cloud backed by Amazon founder Jeff Bezos, is preconfiguring Hadoop to run on its test and development cloud.
Skytap was founded in 2006 with the name Illumita (the same year that AWS debuted) and launched its cloud and new name in 2008. The company runs its Skytap Cloud in a Savvis co-location service managed by Savvis outside of Seattle. The reason why Skytap built its own cloud rather than just running it on Amazon Web Services is that, as the company has explained in the past, the snapshotting and storage features on the Amazon cloud are too slow for test and dev environments even if they are suitable for production environments.
The company's homegrown cloud control freak is not available commercially so you can run it in your own data center, but it is OEMed by services giant CSC, which slaps its brand on it and runs a test and development cloud services for its customers out of its own Chicago data center. The Skytap Cloud uses VMware's ESXi hypervisor to partition capacity.
Skytap has several hundred customers and does not provide the size of its cloud, but does say that since it went live, over 1.9 million virtual machines have been launched on the server, up from 1 million last April.
In the August 2011 release of the Skytap service, the company added self-service cloud orchestration and hub and spoke network configurations, and in the April 2012 release it added a new management interface - and also made it puke out reports to get beancounters off your back about the cost of services.
With today's announcement, Skytap has created server templates for the CDH4 Hadoop distribution from Cloudera, which debuted back in June 2012. Specifically, Skytap is running the Cloudera Enterprise Free edition of CDH4, which is enabled to scale up to 50 nodes (physical or virtual), plus the Cloudera Manager graphical Hadoop management tool.
The Cloudera CDH4 setup on the Skytap Cloud puts three virtualized servers together into a baby cluster. One template sets up the NameNode and other Hadoop management tools (Hive, Oozie, and ZooKeeper) on a virtual machine with two virtual CPUs and 2GB of memory and 40GB of virtual storage.
The other template creates a base compute node for the Hadoop cluster and its underlying Hadoop Distributed File System (HDFS). This Hadoop compute image has one virtual CPU, 1GB of virtual memory, and 40GB of disk capacity. The base cluster has one management node and two compute nodes, and you can add up to 48 more virtual nodes without invoking a license fee and support contract from Cloudera. You can also scale up the CPU, memory, and storage capacity on these images as your workload requires.
All of the virtual nodes in the Hadoop cluster run Canonical's Ubuntu Server 12.04 LTS release, and Skytap's own SmartClient provides root access with a command line to all of the virtual Hadoop nodes.
If you want to develop and test applications on a Hadoop cluster that is larger than 50 nodes, or if you want to run workloads in production on Skytap, you have to get your own support contract from Cloudera. And if you don't need all of the nodes up and running all the time, you can turn off the server nodes and just keep their data on storage (which you have to pay for, of course) and then fire them back up when you need them. You can't do this with a real Hadoop cluster, of course.
Single cloud from a single vendor? That's so yesterday...
The biggest crime you can commit with a physical cluster is to not have work for it to do, and even if you turn it off, it is still un-utilized capital. That's why Amazon doesn't believe in private clouds and only believes in shared public clouds. (Well, except when it comes to its own data centers for running its online retail business, of course.)
Brett Goodwin, vice president of marketing at Skytap, says that the company is not making a commitment to offer templates for Hadoop distributions from MapR Technologies or Hortonworks, the two other big commercial disties that sell supported versions of the open source Apache Hadoop stack.
"Cloudera was the obvious choice because they are the number one distribution for Hadoop," says Goodwin. "We're going to roll this out and get a sense from customers [about] what else they might want."
Microsoft Azure has tapped Hortonworks for its Hadoop variant, and Amazon peddles the Elastic MapReduce service - running either kosher open source Apache Hadoop or the M3 or M5 variants of Hadoop from MapR. The latter M5 offers the ability to make HDFS look like (and makes it mountable by) Network File System clients.
The obvious thing for Skytap to do is to help programmers create dev and test environments for Hadoop and then do some kind of conversion of their apps so they can run either on raw EC2 clouds or on the Elastic MapReduce service over at Amazon Web Services. When El Reg brought this up, Goodwin did a little hemming and hawing, and while not making a commitment to officially run production workloads on Skytap, the company admits that some customers already do that and in the long run, this will be officially sanctioned for Hadoop and other workloads.
"Today, where we are seeing demand in for test and development, proof of concepts, and training," explains Goodwin. "We do believe that customers will want to move applications into production, and it will be important for Skytap to provide production environments. But make no mistake. We expect a multi-cloud, federated world. We don't think an application has to be run on a single cloud from a single vendor."
Skytap does anticipate that companies will create and test their Hadoop application on its eponymous cloud and then run the real workloads on internal physical Hadoop clusters, and in some cases, says Goodwin, customers will want to burst from their internal Hadoop clusters out to the Skytap cloud if they are running shy on internal capacity. This cloudbursting is enabled through Skytap's AutoNetworks multi-VPM networking software.
Of course, Bezos could always have Amazon Web Services buy Skytap and keep all of its goodies for itself. This is not just likely but probable if Skytap has better technology for DevOps than Amazon has created in-house for its AWS cloud. Such an acquisition could be relatively expensive, with Skytap raising $23m in three rounds of venture funding, including cash from Bezos Expeditions as well as Ignition Partners, Madrona Venture Group, Washington Research Group, and OpenView Venture Partners. No matter what, you can bet that Bezos doesn't want Skytap to fall into enemy hands – and AWS has a lot of enemies out there these days. ®