Updated Cloudera might have been the first company to try becoming the Red Hat for stuffed elephants, but with MapR, Hortonworks, IBM, Oracle, DataStax, and EMC all trying to commercialize Hadoop, Cloudera has to keep on its toes and perhaps even balance on a ball.
That's because the underlying Hadoop data muncher is an open source project, so any company that wants to make money on the big data wave has to add value above and beyond the core Hadoop stack. Cloudera thinks it has an edge in managing Hadoop clusters, and believes further that it will extend its lead with a new control freak for its Hadoop distro, called, appropriately enough, Cloudera Manager.
Like other open source companies founded in recent years, Cloudera embraces an "open core" distribution model. That means it wraps up the key open source elements of a particular project – in this case the Hadoop MapReduce application, its underlying Hadoop Distributed File System (HDFS), and a bunch of other things – and distributes this for free and offers commercial-grade support for the stack. But those engaging in open core distribution also peddle closed source add-ons for the open source tools usually under a perpetual or subscription license with an annual support contract and usually offering finer-grained management, more scalability, connectors to third party products, and such.
With Cloudera, the open core is the Cloudera Distribution of Apache Hadoop (abbreviated CDH with a version number) and the extended product is called Cloudera Enterprise (also with a version number). The current open source distro from Cloudera is CDH3, which debuted in April, and was updated in June, and is due for another update next year, Charles Zedlewski, vice president of products at Cloudera, tells El Reg. Tracking alongside of this open source distro is Cloudera Enterprise, which is being upgraded to the 3.7 release level with the addition of a slew of proactive management tools.
In addition, with today's announcement, Cloudera is breaking its Hadoop control freak free of the Cloudera Enterprise stack and offering a freebie version of the tool as well as Cloudera Manager, which includes functionality that used to be included in the Cloudera Management Suite console that was bundled only with Cloudera Enterprise.
The upshot is that Cloudera now has companies that are doing proof of concepts covered with the combination of its CDH3 distro and Cloud Manager Free Edition and production customers who want the full-on Cloud Manager linking into Cloud Enterprise with the 3.7 release.
Cloudera Manager can gather and scan Hadoop logs from the servers in the cluster to look for weird stuff and can even do proactive checking for HDFS and its increasingly popular column-oriented database overlay, HBase. The Hadoop control freak can also send alerts to cluster managers when nodes or services are running slowly or starting to fail; this alerting system has hooks into popular IT management frameworks for consolidating alerts to sysadmins.
Cloud Manager also has a feature called global time control, which correlates logs, system changes, configuration, running jobs, and other aspects of the Hadoop cluster to help admins figure out what went wrong when it inevitably does (as is the case with all complex systems). All of this information is stored in a MySQL database with near-realtime access.
For more sophisticated diagnoses, Cloudera Manager now has a snapshotting feature that can do a core dump on the system state of nodes in the cluster on a scale of minutes to an hour and captures versions of systems and software stacks, settings, logs, any changes, and such that are occurring on the system and packages all this data up and pops it into a file and sends it off to a sysadmin or Cloudera to do debugging and tuning. The time scale on the snapshot is adjustable, but the intent is to keep the file size down in the megabytes so it can be tagged to a specific event in the cluster that needs some work.
Cloudera Manager has all the bells and whistles and is intended for production Hadoop clusters, while Cloudera Manager Free Edition is intended for customers who do not yet need alerting, roll-backs, log search, event management, or proactive health management on their clusters. The free edition, available as a download here, is not open source, and only scales up to 50 nodes in a Hadoop cluster.
By the way the full-on Cloudera Manager 3.7 will work on either the CDH3 or Cloudera Enterprise versions of the stack available from Cloudera, since Cloudera Enterprise is based on the exact same code-set as CDH3. Zedlewski says that the new management tool has been tested on clusters with more than 1,000 nodes and running 10,000 to 15,000 processes.
Cloudera did not originally provide pricing for its Cloudera Enterprise stack or support contracts, but said that the stack is priced on a per-node annual subscription with Cloudera Manager having a per-user annual subscription. But the day after this story ran, the company reconsidered this position and said that it charges $4,000 per node per year for a subscription to Cloudera Enterprise. This is, by the way, precisely the same fee that MapR Technologies is charging for its M5 Hadoop stack and what reseller EMC/Greenplum is charging for its Greenplum HD rebranding of M5.
Cloudera also does not divulge how many customers it has, but Zedlewski tells El Reg that it has more than 100 customers who in turn have many hundreds of clusters with tens of thousands of nodes running its commercial-grade Cloudera Enterprise. The company is not willing to guess publicly how many CDH3-based clusters there might be out there in the world. ®