Hadoop World There are a lot of gripes about the Hadoop Big Data muncher, and one of them is that it was designed as a batch-oriented system. A number of different add-ons to Hadoop have been created to try to make it more like more familiar relational databases (such as HBase) and their SQL query languages (Hive). But even these do not offer the kind of real-time performance that businesses expect. And so, for the last two years, Cloudera has been working on distributed parallel SQL query engine that will slide into the Hadoop stack and turn it into a real-time query engine.
The parallel query engine is known as Project Impala, and it is being launched on Wednesday at the Strata Hadoop World extravaganza in New York. The Impala query engine can run atop the raw Hadoop Distributed File System (HDFS) or the HBase tabular overlay for HDFS that makes it look somewhat like a relational database. And it supports the same SQL syntax as the Hive query tool, which is according to Charles Zedlewski, vice president of products at Cloudera, which is roughly compliant with the ANSI-92 SQL standard.
Like other parts of the Apache Hadoop project for which Cloudera, MapR Technologies, Hortonworks, IBM, and EMC are significant contributors, the Impala SQL query engine will be an open source project and therefore shared with the rest of the community. And it will join a number of other technologies that other players are coming up with to help make Hadoop more like a data warehouse and less like a batch data muncher. And, Zedlewski wants to be clear, it will not replace the MapReduce data chunking and chewing batch work that Hadoop clusters currently do for jobs where this is not only sufficient, but sensible based on economics and the not-so-real-time requirements of particular applications, such as search engine indexing or other kinds of bulk processing.
"MapReduce is by definition a batch system," Zedlewski explains. "There's no getting around that. And despite that Hadoop made a lot of hard things easy. But it also made a lot of easy things hard. We want to make the easy things easy again." This would mean, for example, querying vast sums of data stored in a "cheap and deep storage system" such as Hadoop using normal SQL, and to do so in a way that is more intuitive and that offers fast response time.
The Impala effort is being spearheaded by Marcel Kornacker, who was one of the architects of the query engine in Google's F1 fault tolerant, distributed relational database that now underpins its AdWords ad serving engine. Just like Hadoop MapReduce and HDFS were what happened when Doug Cutting (formerly of Yahoo! and now at Cloudera) read a paper on Google's MapReduce method of building a search engine index based on its Google File System, Impala is an homage of sorts to the F1 effort. (You can see a presentation on Google F1 here (PDF).)
By giving real-time SQL querying capabilities to Hadoop, Cloudera (and soon its peers with their own approaches), the essential message to customers is that rather than having to use two or three different platforms for Big Data, they can instead begin using just one. But companies like IBM and Teradata have some installed bases to protect, and it is not surprising that they tell customers to have a data warehouse storing operational data designed for business managers to ask lots of little questions (InfoSphere at IBM and Teradata at the company of that same name); an analytics platform with a faster parallel database so analysts can ask a few very complex queries and get fast answers (Netezza at IBM and Aster at Teradata); and then Hadoop for processing large amounts of web telemetry and other customer data to eventually be mashed up against operational data in the warehouse. Cloudera, meanwhile, says that it is time to stop moving data around between systems, creating massive amounts of extract/test/and load (ETL) busywork, and to start using the Hadoop elephant and his new Impala sidekick to do the whole shebang. (Which is interesting considering that Hadoop itself was created because Yahoo! wanted to avoid doing ETL busywork and just work on raw clickstream data.)
Cloudera says that 100TB of capacity on a traditional data warehouse costs somewhere between $2m and $10m, but adding 100TB of HDFS capacity to a Hadoop cluster costs something like $200,000. Cloudera is not loopy, and it realizes that companies will keep their operational analytics, business analytics, and reporting jobs in an enterprise data warehouse, but suggests that historical data, processing of raw data, ad hoc exploratory querying, and any transformational batch work can be moved to Cloudera CDH4 running Impala. CDH4 then becomes the "data hub" of legend – which data warehouse suppliers have been trying to position their gear as being for the past decade – with data marts spoking out from them.
Details of the Impala distributed SQL query engine
Generally, says Zedlewski, customers store data in HBase tables in text or sequence file formats, with a smattering of Avro and columnar data structures. Impala will support the text and sequence file formats as well as text, JSON, XML, JPGs, and all the other junk that typically gets dumped into HDFS.
With Impala, Cloudera is basically replacing chunks of the Hive query tool and HBase tabular data store for HDFS with its own code while keeping the good bits of these two tools and maintaining compatibility with the Hive and HBase APIs. And the underlying SQL engine in Impala is not reliant on MapReduce as Hive is, but rather on a fully massively parallel distributed database structure that has a new query planner, a new query coordinator, and a new query execution engine.
The metadata store for Hive is still used, so are JDBC and ODBC drivers for Hive, and so are any tables you have defined in HBase. And any business intelligence tool you created to ride on Hive and HBase that accesses the Hive metastore will still work. So you won't really know you are using Impala instead of Hive – and that is precisely the point and why it took two years for Cloudera to get Project Impala out the door.
"Small queries take under a second with Impala, and even large queries take a lot less time than with Hive and HBase before," says Zedlewski. SQL queries are dispatched in parallel to the chunks of data in the Hadoop cluster, which as you know keeps three copies of every data set for redundancy. Instead of dispatching a MapReduce job to a chunk of data on the cluster, Impala distributes SQL queries in parallel to chunks of data stored in either HBase or HDFS. This SQL processing works a lot faster than batch-oriented MapReduce and also does not require data format conversion.
How Impala and RTQ fit into the Cloudera Hadoop stack
The Impala code is open-sourced under an Apache license, like Hadoop and many of its related projects. It has been in private beta since May 2012 (which is coincidentally when Google first talked about the F1 database, which was already in production). Starting today, Impala goes into public beta and will be shipped independently of Cloudera's CDH4 release until it stabilizes after the beta program is over. Cloudera is recommending that beta customers be at the CDH4.1 release level if they want to play with Impala, and it seems likely that when CDH5 is ready sometime in late Q2 or early Q3 next year, Impala will be rolled into the release as a fully supported project.
While that may be a goal, Cloudera is not making any guarantees. What it is doing is giving customers ample warning as to what it is up to, and helping defend itself against other SQL approaches for Hadoop that are coming to market. Impala itself is expected to be stable and generally available in the first quarter of next year, says Zedlewski.
Along with the launch of Impala, Cloudera is also going to be repacking and repricing its Cloudera Manager, the closed-source tool it peddles along with support contracts for CDH to create a commercial distribution. Cloudera Manager plus tech support costs for $2,600 per node today.
At some point in the not-too-distant future, Cloudera is going to break Cloudera Manager into three pieces, with the management functions and tech support for the core Hadoop stack costing less than today's $2,600 per node price. At that stage, if you want Real Time Data (RTD) functions in Cloudera Manager that hook into HBase, then you will pay more. And if you want to hook into Impala to do Real Time Query (RTQ), then you will pay more on top of that. Cloudera has not settled on the price yet, but it will probably look something like $2,000 for the base function and then an additional grand or two each for the extra RTD and RTQ functions, if El Reg had to guess. ®