Cloudera launches SaaS platform for the lakehouse crowd
And it IS a crowd – marketplace is busy, so it's hoping open approach sets it apart
Former Hadoop stalwart Cloudera has announced a fully managed software as a service (SaaS) version of its data platform which it claims is more open than rivals in the over-crowded market.
With the product Cloudera Data Platform (CDP) One — initially available only on AWS — Cloudera promises analytics and data exploration in a single platform.
Adopting the term "lakehouse" — coined by Databricks to bring to together the messy world of data lakes with the ordered approach of data warehouses — Cloudera is also claiming the new product offers a set of low-code data engineering and exploration tools to improve efficiency for expert business users.
Cloudera merged with Hortonworks in 2018 in a deal worth $5 billion after both firms had ridden the wave of the big-data-on-Hadoop hype.
The merger coincided with the emergence of cloud-based object storage technologies such as AWS S3, Azure BlobStorage and GCP Cloud Storage, which solve many of the same problems as Hadoop Distributed File System.
In September 2019, the company launched its Cloudera Data Platform (CDP) designed to produce an integrated approach to how organizations deploy, manage and consume data across on-premises, hybrid cloud and private cloud infrastructure.
While the cloud version of CDP was available in AWS, Google Cloud and Azure, CTO Ram Venkatesh told The Register it was a platform-as-a-service offer it operated jointly with customers. CDP One is a fully managed service.
It does, however, enter a crowded market. Snowflake has been trying to bring together structured and unstructured data in its SaaS data platform, while Databricks — which shares Cloudera's Hadoop heritage — has brought SQL analytics to its data lake.
But one difference, Venkatesh said, is Cloudera openness to giving customers choice over the tools they use to manage and analyse their data.
- Snowflake wrestles Python, chases China, and ingests unstructured data
- Discord details how it dodged latency with a super-disk made in the cloud
- Cloudera adopts Apache Iceberg, battles Databricks to be most open in data tables
- Snowflake stock drops as some top customers cut usage
"The cardinal sin that was in previous attempts [at combining data lakes and data warehouses] the mapping was always tied to one engine. If it was built on Hive, then Spark would be a second-class citizen. If Spark came up with it — which is which is [Databrick's] Delta — it is not so great for Impala," he said.
But Venkatesh said Cloudera had eschewed this approach with the adoption of Apache Software Foundation’s Iceberg, which offers an open table format, designed for high-performance on big data workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala.
"The middle layer — if it's independent — it's not tied to one master. It's been designed from the ground up to work with cloud storage — not just HDFS — on the bottom end, and on the top end, it is Spark, Hive, Impala and Pesto, things Cloudera may not even support.
"When you have so much data under management, it's just hubris to think that one engine can solve it all," Venkatesh said.
CDP One is now available to customers that sign up and will be widely available later this year. ®