This article is more than 1 year old
Apache Iceberg promises to change the economics of cloud-based data analytics
Adopted by Snowflake, Google and Cloudera, we look at why the Netflix-developed table format is important
Feature By 2015, Netflix had completed its move from an on-premises data warehouse and analytics stack to one based around AWS S3 object storage. But the environment soon began to hit some snags.
"Let me tell you a little bit about Hive tables and our love/hate relationship with them," said Ted Gooch, former database architect at the streaming service.
While there were some good things about Hive, there were also some performance-based issues and "some very surprising behaviors."
"Because it's not a heterogeneous format or a format that's well defined, different engines supported things in different ways," Gooch – now a software engineer at Stripe and an Iceberg committer – said in an online video posted by data lake company Dremio.
Out of these performance and usability challenges inherent in Apache Hive tables in large and demanding data lake environments, the Netflix data team developed a specification for Iceberg, a table format for slow-moving data or slow-evolving data, as Gooch put it. The project was developed at Netflix by Ryan Blue and Dan Weeks, now co-founders of Iceberg company Tabular, and was donated to the Apache Software Foundation as an open source project in November 2018.
Apache Iceberg is an open table format designed for large-scale analytical workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala. The move promises to help organizations bring their analytics engine of choice to their data without going through the expensive and inconvenience of moving it to a new data store. It has also won support from data warehouse and data lake big hitters including Google, Snowflake and Cloudera.
Iceberg in the data lake
Cloud-based blob storage like AWS S3 does not have a way of showing the relationships between files or between a file and a table. As well as making life tough for query engines, it makes changing schemas and time travel difficult. Iceberg sits in the middle of what is a big and growing market. Data lakes alone were estimated to be worth $11.7 billion in 2021, forecast to grow to $61.07 billion by 2029.
"If you're looking at Iceberg from a data lake background, its features are impressive: queries can time travel, transactions are safe so queries never lie, partitioning (data layout) is automatic and can be updated, schema evolution is reliable – no more zombie data! – and a lot more," Blue explained in a blog.
But it also has implications for data warehouses, he said. "Iceberg was built on the assumption that there is no single query layer. Instead, many different processes all use the same underlying data and coordinate through the table format along with a very lightweight catalog. Iceberg enables direct data access needed by all of these use cases and, uniquely, does it without compromising the SQL behavior of data warehouses."
In October, BigLake, Google Cloud's data lake storage engine, began support for Apache Iceberg, with Databricks format Delta and Hudi streaming set to come soon.
Speaking to The Register, Sudhir Hasbe, senior director of product management at Google Cloud, said: "If you're doing fine-grained access control, you need to have a real table format, [analytics engine] Spark is not enough for that. We had some discussion around whether we are going with Iceberg, Delta or Hudi, and our prioritization was based customer feedback. Some of our largest customers were basically deciding in the same realm and they wanted to have something that was really open, driven by the community and so on. Snap [social media company] is one of our early customers, all their analytics is [on Google Cloud] and they wanted to push us towards Iceberg over other formats."
Iceberg, Hudi and Delta
He said Iceberg was becoming the "primary format," although Google is committed to supporting Hudi and Delta in the future. He noted Cloudera and Snowflake were now supporting Iceberg while Google has a partnership with Salesforce over the Iceberg table format.
Cloudera started in 2008 as a data lake company based on Hadoop, which in its early days was run on distributed commodity systems on-premises, with a gradual shift to cloud hosting coming later.
Today, Cloudera sees itself as a multi-cloud data lake platform, and in July it announced its adoption of the Iceberg open table format.
Chris Royles, Cloudera's Field CTO, told The Register that since it was first developed, Iceberg had seen steady adoption as the contributions grew from a number of different organizations, but vendor interest has begun to ramp up over the last year.
"It has lots of capability, but it's very simple," he said. "It's a client library: you can integrate it with any number of client applications, and they can become capable of managing Iceberg table format. It enables us to think in terms of how different clients both within the Cloudera ecosystem, and outside it – the likes of Google or Snowflake – could interact with the same data. Your data is open. It's in a standard format. You can determine how to manage, secure and own it. You can also bring whichever tools you choose to bear on that data."
The result is a reduction in the cost of moving data, and improved throughput and performance, Royles said. "The sheer volume of data you can manage the number of data objects you can manage and the complexity of the partitioning: it's a multiplication factor. You're talking five or 10 times more capable by using Iceberg as a table format."
Snowflake kicked off as a data warehouse, wowing investors with its so-called cloud-native approach to separating storage and compute, allowing a more elastic method than on-prem-based data warehousing. Since its 2020 IPO – which briefly saw it hit a value of $120 billion – the company has diversified as a cloud-based data platform, supporting unstructured data, machine learning language Python, transactional data and most recently Apache Iceberg.
The Snowflake effect
James Malone, Snowflake senior product manager, told El Reg that cloud blob storage such as that offered by AWS, Google and Azure is durable and inexpensive, put could present challenges when it comes to performance analytics.
"The canonical example is if you have 1,000 Apache Parquet files, if you have an engine that's operating on those files, you have to go tell it if they these 1000 tables with one parquet file a piece or if it is two tables with 500 parquet files … it doesn't know," he said. "The problem is even more complex when you have multiple engines operating on the same set of data and then you want things like ACID-compliance and like safe data types. It becomes a huge, complicated mess. As cheap durable cloud storage has proliferated it has also put pressure downward pressure on the problem of figuring out how to do high-performance analytics on top of that. People like the durability and the cost-effectiveness of storage, but they also there's a set of expectations and a set of desires in terms of how engines can work and how you can derive value from that data."
Snowflake supports the idea that Iceberg is agnostic both in terms of the file format and analytics engine. For a cloud-based data platform with a steadily expanding user base, this represents a significant shift in how customers will interact with and, crucially, pay for Snowflake.
The first and smallest move is the idea of external tables. When files are imported into an external table, metadata about the files is saved and a schema is applied on read when a query is run on a table. "That allows you to project a table on top of a set of data that's managed by some other system, so maybe I do have a Hadoop cluster that I have a meta store that that system owns the security, it owns the updates, it owns the transactional safety," Malone said. "External tables are really good for situation like that, because it allows you to not only query the data in Snowflake, but you can also use our data sharing and governance tools."
- Structured data, unstructured data: It shouldn't matter, says Google
- Linux Foundation launches European division
- Open source databases: What are they and why do they matter?
- Cloudera launches SaaS platform for the lakehouse crowd
But the bigger move from Snowflake, currently only available in preview, is its plan to build a brand-new table type inside of Snowflake. It is set to have parity in terms of features and performance with a standard Snowflake table, but uses Parquets as the data format, and Iceberg as the metadata format. Crucially, it allows customers to bring their own storage to Snowflake instead of Snowflake managing the storage for them, perhaps a significant cost in the analytics setup. "Traditionally with the standard Snowflake table, Snowflake provides the cloud storage. With an Iceberg table, it's the customer that provides the cloud storage and that's a huge shift," Malone said.
The move promises to give customers the option of taking advantage of volume discounts negotiated with blob storage providers across all their storage, or negotiate new deals based on demand, and only pay Snowflake for the technology it provides in terms of analytics, governance, security and so on.
"The reality is, customers have a lot of data storage and telling people to go move and load data into your system creates friction for them to actually go use your product and is not generally a value add for the customer," Malone said. "So we've built Iceberg tables in a way where our platform benefits work, without customers having to go through the process of loading data into Snowflake. It meets the customer where they are and still provides all of the benefits."
But Iceberg does not only affect the data warehouse market, it also has an impact on data lakes and the emerging lakehouse category, which claims to be a useful combination of the data warehouse and lake concepts. Founded in 2015, Dremio places itself in the lakehouse category also espoused by Databricks and tiny Californian startup Onehouse.
Dremio was the first tech vendor to really start evangelizing Iceberg, according to co-founder and chief product officer Tomer Shiran. Unlike Snowflake and other data warehouse vendors, Dremio has always advocated an open data architecture, using Iceberg to bring analytics to the data, rather than the other way around, he said. "The world is moving in our direction. All the big tech companies have been built on an open data architect and now the leading banks are moving with them."
Shiran said the difference with Dremio's incorporation of Iceberg is that the company has used the table format to design a platform to support concurrent production workloads, in the same way as traditional data warehouses, while offering users the flexibility to access data where they have it, based on a business-level UI, rather than the approach of Databricks, for example, which is more designed with data scientists in mind.
Open source project
While Databricks supports both its own Delta table standard and Iceberg, Shiran argues that Iceberg's breadth of support will help it win out in the long run.
"Neither is going away," Shiran said. "Our own query engine supports both table formats, but Iceberg is vendor agnostic and Apache marshals contributions from dozens companies including Netflix, Apple and Amazon. You can see how diverse it is but with Delta, although it is technically open source, Databricks is the sole contributor."
However, Databricks disputes this line. Speaking to The Register in November, CEO and co-founder Ali Ghodsi said there were multiple ways to justify Delta Lake as an open source project. "It's a Linux Foundation. We contribute a lot to it, but its governance structure is in Linux Foundation. And then there's the Iceberg and Hudi, which are both Apache projects."
Ghodsi argued the three table formats – Iceberg, Hudi and Delta – were similar and all were likely to be adopted across the board by the majority of vendors. But the lakehouse concept distinguishes Databricks from the data warehouse vendors even as they make efforts to adopt these formats.
"The data warehousing engines all say they support Iceberg, Hudi and Delta, but they're not optimized really for this," he said. "They're not incentivized to do it well either because if they do that well, then their own revenue will be cannibalized: you don't need to pay any more for storing the data inside the data warehouse. A lot of this is, frankly speaking, marketing by a lot of vendors to check a box. We're excited that the lakehouse actually is taking off. And we believe that the future will be lakehouse-first. Vendors like Databricks, like Starburst, like Dremio will be the way people want to use this."
Nonetheless, database vendor Teradata has eschewed the lakehouse concept. Speaking to The Register in October, CTO Stephen Brobst argued that a data lake and data warehouse should be discrete concepts within a coherent data architecture. The argument plays to the vendor's historic strengths in query optimization and supporting thousands of concurrent users in analytics implementations which include some of the world's largest banks and retailers.
Hyoun Park, CEO and chief analyst at Amalgam Insights, said most vendors are likely to support all three table formats – Iceberg, Delta and Hudi – in some form or other, but Snowflake's move with Iceberg is the most significant because it represents a departure for the data warehouse firm in terms of its cost model, but also how it can be deployed.
"It's going to continue to be a three-platform race, at least for the next couple of years, because Hudi benchmarks as being slower than the other two platforms but provides more flexibility in how you can use the data, how you can read the data, how you can ingest the data. Delta Lake versus Iceberg tends to be more of a commercial decision because of the way that the vendors have supported this basically, Databricks on one side and everybody else on the other," he said.
But when it comes to Snowflake, the argument takes a new dimension. Although Iceberg promises to extend the application of the data warehouse vendor's analytics engine beyond its environment – potentially reducing the cost inherent in moving data – that will come at a price: the very qualities that made Snowflake so appealing in the first place, Park said.
"You're now managing two technologies rather than simply managing your data warehouse which was which is the appeal of Snowflake," he said. "Snowflake is very easy to get started as a data warehouse. And that ease of use is the kind of that first hit, that drug-like experience, that gets Snowflake started within the enterprise. And then because Snowflakes pricing is so linked to data use, companies quickly find that as their data grows 50, 60, 70, or 100 percent per year. Their Snowflake bills increase just as quickly. Using Iceberg tables is going to be a way to cut some of those costs, but it comes at the price of losing the ease of use that Snowflake has provided."
Apache Iceberg surfaced in 2022 as a technology to watch to help solve problems in data integration, management and costs. Deniz Parmaksız, machine learning engineer with customer experience platform Insider, recently claimed it cut their Amazon S3 Cost by 90 percent.
While major players – including Google, Snowflake, Databricks, Dremio and Cloudera – have set out their stall on Iceberg, AWS and Azure have been more cautious. With Amazon Athena, the serverless analytics service, users can query Iceberg data. But Azure Ingestion from data storage systems that provide ACID functionality on top of regular Parquet format files – such as Iceberg, Hudi, Delta Lake – are not supported. Microsoft has been contacted for clarity on its approach. Nonetheless, in 2023, expect to see more news on the emerging data format which promises to shake up the burgeoning market for cloud data analytics. ®
Updated to add on 13 January 13 2023:
A Microsoft spokesperson told The Reg: "The Spark services in Azure available in Azure Databricks and Azure Synapse Analytics support analyzing data from Delta, Iceberg and Hoodi sources using the standard libraries."