Industry reacts to DuckDB's radical rethink of Lakehouse architecture

Excitement over DuckLake, but momentum is with Iceberg as players at AWS, Snowflake weigh in

It's been a year since Databricks bought Tabular for $1 billion, livening up the sleepy world of table formats.

People have been playing with it. It's captured people's imaginations for sure...

The data lake company, with its origins around Apache Spark, had created the Delta Lake table format to help users bring query engines to data outside its systems. With the Tabular acquisition, it bought the creators of rival format Iceberg. The project was developed at Netflix by Ryan Blue and Dan Weeks, co-founders of Tabular, and donated to the Apache Software Foundation as an open source project in November 2018.

But six years on, just as the merger between Databricks and Tabular prompted signs of a confluence between the two formats, fledgling database DuckDB proposed an alternative architecture, prompting excitement, curiosity and wariness from other members of the community.

Speaking to The Register, AWS veep and distinguished engineer Andy Warfield said the cloud giant's engineering team was "super excited" about the announcement. "It was passed around broadly across the teams, and people have been playing with it. It's captured people's imaginations for sure," he said.

As El Reg explained last week, DuckDB, which launched an in-process analytics database in 2022, has proposed its own table format, DuckLake, and an extension to DuckDB to allow it to act as client-server data warehouse or data lake system on a single set of data — in S3 or other blob storage. It also proposed a database to manage and store metadata, as opposed to Delta Lake and Iceberg, which don't employ such a database.

There has been a groundswell of interest in DuckDB from the data engineering and analytics community for a DuckDB based single host notebook client that has a "really nice query API and data visualizer," for example, Warfield said.

DuckLake shows the DuckDB team — the format is open source but supported by company DuckDB Labs — has correctly observed some weaknesses in the current implementation of Open Table Formats (OTFs) like Delta Lake and Iceberg.

"When you start with that serialized, persistent format, it leaves out a lot of the performance focus that would exist in an I/O layer for a database… they've just observed that all of those things are kind of absent, and with the existing data path for Iceberg and the other OFTs, you end up doing a lot of round trips to storage, which is potentially quite expensive from a performance perspective. What DuckLake does is replace that full metadata management layer with a database schema and a database back end," he said.

However, there have already been moves within the Iceberg, S3, and most likely other OTF communities to address the same problems, AWS's Warfield added.

"They're really focused on addressing the performance challenges at that layer, basically moving from the sort of persistent on disk definition of those tables to something that's like much higher performance. A lot of the things that the Duck folks mentioned in terms of performance gaps are addressable with some of the proposed APIs in Iceberg, like the scan API, and they're also addressable through either really aggressive client-side caching, which DuckDB is [already] doing for Iceberg.

"We'll see the other OFTs move to faster and better performance by growing a mid-layer, probably evolving their APIs. DuckDB jumped ahead in terms of providing a really interesting demonstration of what's possible there," he said.

Others were more skeptical. Jake Ye, an AWS veteran and software engineer at AI database company LanceDB, blogged that while DuckDB proposes a SQL database for metadata, the industry has been "increasingly consolidated around JSON-based protocols as the foundation for interoperability."

"This is evident not just in the catalog space with standards like Iceberg REST Catalog (IRC), Polaris, Gravitino, Unity, but also in the AI domain with MCP and A2A. That was also a key design decision behind the Lance Namespace spec. While defining a spec in SQL is an interesting idea, it poses real adoption challenges without good structured extensibility, versioning and transport-layer separation," he wrote in a LinkedIn post.

Baked into projects

Russell Spitzer, principal engineer with Snowflake, told us many projects were "pretty far along the road with Iceberg, "which the cloud data warehouse, data lake and analytics company has been backing since the table format's early days.

He also said DuckDB's proposals address problems the Iceberg community is already addressing. "The actual storage of the metadata, to me, is an implementation detail, and whether you store it in the file system, or you store it in a catalog, or something like that is or a relational data store, is not as important as the APIs you use to interact with it."

Important here is the REST spec, which "behind the scenes" can keep information about metadata, where the files are. "All that stuff can be held in whatever system you want or cached in-memory. It doesn't even have to be part of some relational system. You can have a cache layer that's completely independent of relational semantics," he said.

Meanwhile, Spitzer, a former software engineer at Apple working on Iceberg projects, said he was concerned about the generality of SQL for the purpose of handling metadata.

"It's basically allowing folks direct access to the underlying persistence layer. In Iceberg, if you had built a system where in order to do a commit, you would basically go in and write your own files by hand, rather than going through an established SDK. In SQL, you can do anything. You can go in and modify any cell of any row at any time, but doing so might not actually be transactional in the semantics of your lakehouse. You want to stop users from being able to do just about anything," he said.

Meanwhile, Iceberg is not standing still. Released earlier this month, v3 promises new variant type support, which has involved collaboration between major players in the Iceberg community, said Spitzer. "We're going to have users who no longer really have full schema restrictions. This is really great for IoT users or people who have data sources they can't control. You can imagine, you're working with a sensor data system, and the sensors got upgraded outside of your control, and now, all of a sudden, there's a different field with a variant data type you can do that without changing the schema of your table," he said.

With multi-billion-dollar revenue organizations such as AWS, Snowflake, and even Databricks trying to steer the future of Iceberg, DuckDB and DuckLake is going to be paddling furiously to increase its own momentum. ®

More about

TIP US OFF

Send us news


Other stories you might like