Cloudera's Project Impala rides herd with Hadoop elephant in real-time

Life's no longer a batch


Hadoop World There are a lot of gripes about the Hadoop Big Data muncher, and one of them is that it was designed as a batch-oriented system. A number of different add-ons to Hadoop have been created to try to make it more like more familiar relational databases (such as HBase) and their SQL query languages (Hive). But even these do not offer the kind of real-time performance that businesses expect. And so, for the last two years, Cloudera has been working on distributed parallel SQL query engine that will slide into the Hadoop stack and turn it into a real-time query engine.

The parallel query engine is known as Project Impala, and it is being launched on Wednesday at the Strata Hadoop World extravaganza in New York. The Impala query engine can run atop the raw Hadoop Distributed File System (HDFS) or the HBase tabular overlay for HDFS that makes it look somewhat like a relational database. And it supports the same SQL syntax as the Hive query tool, which is according to Charles Zedlewski, vice president of products at Cloudera, which is roughly compliant with the ANSI-92 SQL standard.

Like other parts of the Apache Hadoop project for which Cloudera, MapR Technologies, Hortonworks, IBM, and EMC are significant contributors, the Impala SQL query engine will be an open source project and therefore shared with the rest of the community. And it will join a number of other technologies that other players are coming up with to help make Hadoop more like a data warehouse and less like a batch data muncher. And, Zedlewski wants to be clear, it will not replace the MapReduce data chunking and chewing batch work that Hadoop clusters currently do for jobs where this is not only sufficient, but sensible based on economics and the not-so-real-time requirements of particular applications, such as search engine indexing or other kinds of bulk processing.

"MapReduce is by definition a batch system," Zedlewski explains. "There's no getting around that. And despite that Hadoop made a lot of hard things easy. But it also made a lot of easy things hard. We want to make the easy things easy again." This would mean, for example, querying vast sums of data stored in a "cheap and deep storage system" such as Hadoop using normal SQL, and to do so in a way that is more intuitive and that offers fast response time.

The Impala effort is being spearheaded by Marcel Kornacker, who was one of the architects of the query engine in Google's F1 fault tolerant, distributed relational database that now underpins its AdWords ad serving engine. Just like Hadoop MapReduce and HDFS were what happened when Doug Cutting (formerly of Yahoo! and now at Cloudera) read a paper on Google's MapReduce method of building a search engine index based on its Google File System, Impala is an homage of sorts to the F1 effort. (You can see a presentation on Google F1 here (PDF).)

By giving real-time SQL querying capabilities to Hadoop, Cloudera (and soon its peers with their own approaches), the essential message to customers is that rather than having to use two or three different platforms for Big Data, they can instead begin using just one. But companies like IBM and Teradata have some installed bases to protect, and it is not surprising that they tell customers to have a data warehouse storing operational data designed for business managers to ask lots of little questions (InfoSphere at IBM and Teradata at the company of that same name); an analytics platform with a faster parallel database so analysts can ask a few very complex queries and get fast answers (Netezza at IBM and Aster at Teradata); and then Hadoop for processing large amounts of web telemetry and other customer data to eventually be mashed up against operational data in the warehouse. Cloudera, meanwhile, says that it is time to stop moving data around between systems, creating massive amounts of extract/test/and load (ETL) busywork, and to start using the Hadoop elephant and his new Impala sidekick to do the whole shebang. (Which is interesting considering that Hadoop itself was created because Yahoo! wanted to avoid doing ETL busywork and just work on raw clickstream data.)

Cloudera says that 100TB of capacity on a traditional data warehouse costs somewhere between $2m and $10m, but adding 100TB of HDFS capacity to a Hadoop cluster costs something like $200,000. Cloudera is not loopy, and it realizes that companies will keep their operational analytics, business analytics, and reporting jobs in an enterprise data warehouse, but suggests that historical data, processing of raw data, ad hoc exploratory querying, and any transformational batch work can be moved to Cloudera CDH4 running Impala. CDH4 then becomes the "data hub" of legend – which data warehouse suppliers have been trying to position their gear as being for the past decade – with data marts spoking out from them.

Details of the Impala distributed SQL query engine

Details of the Impala distributed SQL query engine

Generally, says Zedlewski, customers store data in HBase tables in text or sequence file formats, with a smattering of Avro and columnar data structures. Impala will support the text and sequence file formats as well as text, JSON, XML, JPGs, and all the other junk that typically gets dumped into HDFS.

With Impala, Cloudera is basically replacing chunks of the Hive query tool and HBase tabular data store for HDFS with its own code while keeping the good bits of these two tools and maintaining compatibility with the Hive and HBase APIs. And the underlying SQL engine in Impala is not reliant on MapReduce as Hive is, but rather on a fully massively parallel distributed database structure that has a new query planner, a new query coordinator, and a new query execution engine.

The metadata store for Hive is still used, so are JDBC and ODBC drivers for Hive, and so are any tables you have defined in HBase. And any business intelligence tool you created to ride on Hive and HBase that accesses the Hive metastore will still work. So you won't really know you are using Impala instead of Hive – and that is precisely the point and why it took two years for Cloudera to get Project Impala out the door.

"Small queries take under a second with Impala, and even large queries take a lot less time than with Hive and HBase before," says Zedlewski. SQL queries are dispatched in parallel to the chunks of data in the Hadoop cluster, which as you know keeps three copies of every data set for redundancy. Instead of dispatching a MapReduce job to a chunk of data on the cluster, Impala distributes SQL queries in parallel to chunks of data stored in either HBase or HDFS. This SQL processing works a lot faster than batch-oriented MapReduce and also does not require data format conversion.

How Impala and RTQ fit into the Cloudera Hadoop stack

How Impala and RTQ fit into the Cloudera Hadoop stack

The Impala code is open-sourced under an Apache license, like Hadoop and many of its related projects. It has been in private beta since May 2012 (which is coincidentally when Google first talked about the F1 database, which was already in production). Starting today, Impala goes into public beta and will be shipped independently of Cloudera's CDH4 release until it stabilizes after the beta program is over. Cloudera is recommending that beta customers be at the CDH4.1 release level if they want to play with Impala, and it seems likely that when CDH5 is ready sometime in late Q2 or early Q3 next year, Impala will be rolled into the release as a fully supported project.

While that may be a goal, Cloudera is not making any guarantees. What it is doing is giving customers ample warning as to what it is up to, and helping defend itself against other SQL approaches for Hadoop that are coming to market. Impala itself is expected to be stable and generally available in the first quarter of next year, says Zedlewski.

Along with the launch of Impala, Cloudera is also going to be repacking and repricing its Cloudera Manager, the closed-source tool it peddles along with support contracts for CDH to create a commercial distribution. Cloudera Manager plus tech support costs for $2,600 per node today.

At some point in the not-too-distant future, Cloudera is going to break Cloudera Manager into three pieces, with the management functions and tech support for the core Hadoop stack costing less than today's $2,600 per node price. At that stage, if you want Real Time Data (RTD) functions in Cloudera Manager that hook into HBase, then you will pay more. And if you want to hook into Impala to do Real Time Query (RTQ), then you will pay more on top of that. Cloudera has not settled on the price yet, but it will probably look something like $2,000 for the base function and then an additional grand or two each for the extra RTD and RTQ functions, if El Reg had to guess. ®

Similar topics


Other stories you might like

  • Monero-mining botnet targets Windows, Linux web servers
    Sysrv-K malware infects unpatched tin, Microsoft warns

    The latest variant of the Sysrv botnet malware is menacing Windows and Linux systems with an expanded list of vulnerabilities to exploit, according to Microsoft.

    The strain, which Microsoft's Security Intelligence team calls Sysrv-K, scans the internet for web servers that have security holes, such as path traversal, remote file disclosure, and arbitrary file download bugs, that can be exploited to infect the machines.

    The vulnerabilities, all of which have patches available, include flaws in WordPress plugins such as the recently uncovered remote code execution hole in the Spring Cloud Gateway software tracked as CVE-2022-22947 that Uncle Sam's CISA warned of this week.

    Continue reading
  • Red Hat Kubernetes security report finds people are the problem
    Puny human brains baffled by K8s complexity, leading to blunder fears

    Kubernetes, despite being widely regarded as an important technology by IT leaders, continues to pose problems for those deploying it. And the problem, apparently, is us.

    The open source container orchestration software, being used or evaluated by 96 per cent of organizations surveyed [PDF] last year by the Cloud Native Computing Foundation, has a reputation for complexity.

    Witness the sarcasm: "Kubernetes is so easy to use that a company devoted solely to troubleshooting issues with it has raised $67 million," quipped Corey Quinn, chief cloud economist at IT consultancy The Duckbill Group, in a Twitter post on Monday referencing investment in a startup called Komodor. And the consequences of the software's complication can be seen in the difficulties reported by those using it.

    Continue reading
  • Infosys skips government meeting – and collecting government taxes
    Tax portal wobbles, again

    Services giant Infosys has had a difficult week, with one of its flagship projects wobbling and India's government continuing to pressure it over labor practices.

    The wobbly projext is India's portal for filing Goods and Services Tax returns. According to India's Central Board of Indirect Taxes and Customs (CBIC), the IT services giant reported a "technical glitch" that meant auto-populated forms weren't ready for taxpayers. The company was directed to fix it and CBIC was faced with extending due dates for tax payments.

    Continue reading
  • Google keeps legacy G Suite alive and free for personal use
    Phew!

    Google has quietly dropped its demand that users of its free G Suite legacy edition cough up to continue enjoying custom email domains and cloudy productivity tools.

    This story starts in 2006 with the launch of “Google Apps for Your Domain”, a bundle of services that included email, a calendar, Google Talk, and a website building tool. Beta users were offered the service at no cost, complete with the ability to use a custom domain if users let Google handle their MX record.

    The service evolved over the years and added more services, and in 2020 Google rebranded its online productivity offering as “Workspace”. Beta users got most of the updated offerings at no cost.

    Continue reading
  • GNU Compiler Collection adds support for China's LoongArch CPU family
    MIPS...ish is on the march in the Middle Kingdom

    Version 12.1 of the GNU Compiler Collection (GCC) was released this month, and among its many changes is support for China's LoongArch processor architecture.

    The announcement of the release is here; the LoongArch port was accepted as recently as March.

    China's Academy of Sciences developed a family of MIPS-compatible microprocessors in the early 2000s. In 2010 the tech was spun out into a company callled Loongson Technology which today markets silicon under the brand "Godson". The company bills itself as working to develop technology that secures China and underpins its ability to innovate, a reflection of Beijing's believe that home-grown CPU architectures are critical to the nation's future.

    Continue reading

Biting the hand that feeds IT © 1998–2022