Big data is not just a problem because it is big, but because it keeps swelling. That goes as much for traditional data warehouses as it does for more modern Hadoop MapReduce data munchers. And with the latest update of its eponymous database, the Greenplum division of IT conglomerate EMC has made some tweaks to its homegrown database to make wrestling with big data a bit easier.
The Greenplum Database is available in two forms, just like its predecessor. One runs on Greenplum's own hardware appliance (which is based on hardware from an unspecified server OEM partner), and the other is a software-only distribution that customers can run on any x86 server machine that supports Red Hat Enterprise Linux, Oracle Solaris, or Apple's OS X.
The Greenplum database is a parallelized and heavily customized version of the open source PostgreSQL database, and has been optimized for ad hoc queries instead of transaction processing. It is a massively parallel, shared-nothing database and has "polymorphic data storage" to allow database administrators to carve up a range of data in a database table and choice the row or column orientation a query should use as well as what storage, execution, or compression settings that should apply to this segment of data.
Like other data warehouse engines, the Greenplum Database is a heavy user of data compression to speed up queries and reduce disk storage capacity needs.
Greenplum's Hadoop distributions are similarly available on the same hardware appliances – with some tweaks – as well as a software-only product that can run on any Linux-based x86 server.
Back in December, Greenplum unveiled its long-range plan to mashup its data warehousing and Hadoop stacks to create a giant data muncher called the Unified Analytics Platform.
The building blocks of the Greenplum Database
With Greenplum Database 4.2, the EMC unit is doing a few different things. First, on as it promised back in December, Greenplum has tweaked its parallel data warehouse loading technology, called gNET, so it can import and export data in parallel from a warehouse to a Hadoop cluster.
Equally significant is that the gNET feature in the 4.2 release of the relational database actually allows for gNET to reach into the Hadoop cluster and query data right where it is sitting, using some of the Hadoop cluster resources instead of burdening the iron running the data warehouse.
"This used to be a read-only tool," explains Mike Maxey, senior director of product marketing at the Greenplum. "Now it leaves more data in Hadoop and does more processing inside Hadoop."
Greenplum Database 4.2 also includes a new management console called Command Center, which replaces an older tool called PerfMon that database admins have been using up until now. Maxey says Command Center, unlike PerfMon, is a Web-based tool and has more functions that database admins have been looking for, such as the ability to start, stop and initialize databases on the fly, recover and rebalance database mirrors, and search, prioritize, or cancel any query on the system.
Command Center is also able to reach out across the network into a Greenplum HD or MR Hadoop cluster and check the state of the cluster from inside this console. "Over time, Command Center will evolve to have broader and deeper coverage of the database and Hadoop platforms," says Maxey.
The initial release of Command Center is available initially with the Data Computing Appliance 1.2 system, and will eventually be available in the software-only distribution.
The 4.2 release of the database has the requisite performance tweaks, including dynamic partition elimination and query memory optimization. The database also has a new package manager that does automatic installation and updating of extensions to the database on a running system with multiple nodes and different features running hither and yon.
Finally, EMC has integrated its Data Domain Boost data de-duplication backup software with Greenplum Database 4.2. In benchmark tests, EMC was able to back up a 173TB data warehouse in under less than eight hours. This was achieved by spreading parts of the Data Domain de-duping operations over the data warehouse nodes in an appliance, thus parallelizing the massive job and making the backup run faster because the de-duping was faster.
At the Strata Conference today in Santa Clara, California, in addition to launching the new database release, Greenplum is also talking up its ability to run Greenplum MR Hadoop atop Cisco Systems' C-Series rack-based servers. El Reg already told you all about that two weeks ago. ®