The force is strong in Iceberg: Are the table format wars entering the final chapter?

Former Apple engineer and Apache PMC member Russell Spitzer describes efforts to unite around a single format

Interview In June, Databricks shelled out $1 billion for Tabular, a startup backer of the open source Apache Iceberg table format, signalling just how important the rather niche topic had become. It was a move which shocked the Iceberg community.

There were two reasons for this. Firstly, Databricks — nominally worth $43 billion after $4 billion in VC funding — had been promoting its own table format, Delta, an open source project managed by the Linux Foundation, and secondly, Tabular was founded by the original authors of the rival format.

icebergs iceland

Tabular's Iceberg vision goes from Netflix and chill to database thrill

READ MORE

Iceberg had already become the basis of data engineering and analytics strategies in global tech and media companies, including Netflix and Apple, where Iceberg is said to be wall-to-wall.

Speaking to The Register, former Apple software engineering manager Russell Spitzer says the acquisition was very surprising.

"The valuation itself is so large. [Tabular] is such a small company. But it was also surprising in that Apache Iceberg is free. If you want to become a part of the Apache Iceberg community and drive its development, all you have to do is contribute code and engineering time. Spending that much money to contribute to Apache Iceberg is… well, it's a decision," says Spitzer, who joined Snowflake as principal engineer in June and remains an Apache Iceberg committer and PMC member.

The surprise acquisition also caused consternation in the Iceberg community. "Definitely some folks in the community were a little worried about it. I wasn't because the people I know who have gone to Databricks are true believers in the Apache way of doing things; true believers in open source. And it's one of those things where I just don't think they can be bought. There are people who believe in this project and want it to succeed. To me, it just feels like they're now getting paid an enormous amount of money to continue doing what they were doing anyway. So, well, we'll see," he says.

The Iceberg project was developed at Netflix from around 2015 by Ryan Blue and Dan Weeks, who went on to become co-founders of billion-dollar Iceberg company Tabular. It was donated to the Apache Software Foundation as an open source project in November 2018.

Apache Iceberg is an open table format designed for large-scale analytical workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala. The idea is organizations can bring their analytics engine of choice to their data without going through the expense and inconvenience of moving it to a new data store. In 2022, Apache Iceberg won support from data warehouse and data lake big hitters including Google, Snowflake and Cloudera.

A doctor of bio-informatics, Spitzer started working on Iceberg not long after joining Apple, from DataStax, where he had helped develop the column-oriented Cassandra database company's Apache Spark connector. He had expected to continue his work with Spark, first developed by Databricks CTO and co-founder Matei Zaharia, but that changed quickly.

"I had a lot of friends who had gone to Apple, and they told me that Apple has a Spark team and Apple has a Cassandra team. So, I started applying to join them. Right as I was joining, the people who were hiring me told me about this Apache Iceberg thing that's just starting up and they needed someone who is into open source to join with them and give it a kick. I said yes, and the rest is pretty much history," he says.

Now, Iceberg is "huge" within Apple, Spitzer says. "When I joined the company, it was almost nowhere. Back then, no one knew what Iceberg was. You had to explain why it was something you wanted. Now, everyone knows what Iceberg is. Within Apple, a large majority of people are saying that's the table format we're going to choose to run with."

But Iceberg was not the only game in town. Databricks had developed its own table format, Delta Lake, with a similar aim to Iceberg. There is also Apache Hudi, which its backers say offers more than just a table format; ingest tools, a very different concurrency model, indexes are also part of the package.

Spitzer says he first became aware of Delta at Spark Summit when they first initially announced Databricks Delta. "I was sitting there and thinking, 'This solves a lot of problems.' And then they said it's a closed-source product that's available with Databricks. I'm like, 'Oh, well, I guess we won't be working with that'."

Delta Lake 2.0 was donated to the Linux Foundation in the middle of 2022, but critics have argued it is too closely aligned with Databricks, despite the vendor protesting that it is controlled by the Linux Foundation.

It is a view once shared by Ryan Blue, who first helped develop Iceberg. Speaking to The Register in September 2023, he said Databricks had done a good job building Delta, but there were concerns about it "in terms of the neutrality of the format, and the ability for other players to really invest and get the most out of it, because it is so tightly controlled by Databricks."

However, following the $1 billion acquisition of the company he helped found, he has taken a more nuanced view.

Speaking on a recent Databricks webinar, Blue said the long-term plan was to converge Iceberg and Delta. "I'm actually excited about taking the ideas from Delta, the ideas from Iceberg, and converging to something that is actually better than both of them today, and I'm very happy to get to work on that. We know that's going to take a few years, but that's [a] long term vision."

In the meantime, it would be up to Databricks UniForm, designed to allow data stored in Delta to be read as if it were Apache Iceberg or Apache Hudi, to help interoperation between the two formats.

Recently UniForm has introduced the ability to read Delta tables with Iceberg reader clients. Data catalogs would also play a role, Blue said on the webinar. Databricks catalog Unity became open source under the wing of the Linux Foundation earlier this year.

On the webinar, Blue admitted the schism in table formats had created a problem for developers. "You have to choose which one to use. Engines and other frameworks have chosen to support one or not the other, and now we have just a bigger problem of inconsistencies, and that is preventing people from choosing a modern format at all, which is absolutely the worst thing [to do]."

Rather than converging the two formats, Snowflake's Spitzer says he hoped the de facto standard would be Iceberg. "What I hope happens, is that we're all using Apache Iceberg as a format underneath the hood, so we basically eliminate table format as a design point."

Spitzer says Snowflake's plan is to offer users an integrated data warehouse and analytics stack as a service, while developers wanting to create a more heterogeneous environment could bring Snowflake's analytics engine to their data wherever it is, using the Iceberg format. "That will be beneficial for Snowflake as well. That's why they hired me. Being able to operate with anybody's data without changing it is also a huge value to Snowflake," he says.

Although both sides of the so-called Table Format Wars are playing nicely, concerns about corporate influence remain.

Spitzer says vendors are lining up projects to contribute to Iceberg, increasing its chances of unifying the field.

"We'll probably very soon see some of the other groups starting their own Iceberg investments. Pretty much everyone is seeing that this the future you can connect with. If you bid on Iceberg, you aren't going to get double-crossed sometime in the future. It's something that you can be a part of, that you can control. That gives a lot of people security in the future," he says. ®

More about

TIP US OFF

Send us news


Other stories you might like