Lance takes aim at Parquet in file format joust

Challenger seeks to unseat incumbent for machine learning workloads

A fledgling file format that aims to address limitations in the widely-used Parquet is under review for adoption by an open source foundation.

Lance is built on the idea that Parquet – widely used in AWS, Azure, and Google data lakes – shows its age when it comes to machine learning and AI, and an additional, complementary format better suits those requirements.

Behind the format is Chang She, one of the original contributors to the pandas software library for data manipulation and analysis, who is now CEO and co-founder of LanceDB, which supports and develops the format.

"In 2022, we had our first Lance 0.01 release, we were widely seen as a little bit crazy for suggesting that there was a better alternative to Parquet. Certainly, the world has changed since then," She said.

The turning point, according to She, came when AI and machine learning started driving more data use than traditional analytics.

With everyone now able to tap into models from OpenAI or Anthropic, the real advantage lies in how quickly those systems can get at your data.

"The core question that we wanted to ask is, what is the relationship between AI and data. Advanced agent techniques get spread out pretty fast. If you're an enterprise, what really differentiates your AI from that of your competitors is data," She said.

However the challenges of accessing data for machine learning inferencing are different from using it for analytics.

"The velocity is much faster, because now a lot of this data is just being generated by the model, and you're looking at hundreds of tokens per second of automatic data generation. Then there is variety: instead of just numbers and timestamps, now you have long text prompts, images, audio waves, and, the [vector] embeddings themselves," She said.

The incumbent file format was not designed to meet these requirements, he argued.

"Parquet is terrible for storing larger data types," She said. "If you have multimodal data, anything from like long text to embeddings to images and videos, Parquet is not optimized at all for this new kind of data. This is because it has row groups and because of the way the data is laid out. You'll run out of memory when you're trying to write large scale."

AI also introduces a lot of new workloads such as vector search and retrieval. She said Parquet is "really terrible for search and retrieval" because it requires random access, and it's not like analytics, "where you're reading continuous ranges of that data."

The Lance file format 2.1 was announced in March, and LanceDB said it was now stable earlier this month.

The Lance format is better adapted to the challenges of storing data for machine learning and AI because it includes a file format, a table format, and secondary indexes, its authors argue.

"The data is laid out differently, and the access patterns are changed, so that we guarantee both faster scans than Parquet, and we also guarantee really fast random access," She said.

Lance was open sourced in August 2022 and the company is in the process of donating it to a foundation, with an announcement expected by the end of the year.

Parquet has its own table format partners. Apache Iceberg, Delta Lake (a Linux Foundation project), and Apache Hudi are all used to bring analytics engines to data without having to move it. There have been recent moves to bring Iceberg and Delta closer together.

She argued that Lance was not an effort to replace these formats, but work alongside them.

"Our motto is Lance for AI and Iceberg for BI. Analytical workloads, we would still expect that to be stored in Iceberg, but with Lance we would expect the use cases and data sets where it's AI heavy: search, training and AI inference would be in Lance," She said.

Projects like Iceberg and Parquet have the advantage of momentum, though, as Matthew Mullins, CTO of data operations platform vendor Coginiti, points out.

"Parquet and Iceberg have the advantage of incumbency and broad support. Apache Iceberg is really just about two years into really breaking out after a decade of development. A big key to that was Snowflake and Databricks both going all in on Iceberg, now every vendor has an Apache Iceberg story and it's on every enterprise roadmap. LanceDB will have a long road, maybe accelerated by AI, but it's going to need more community support to be successful."

Iceberg was once in its nascent stage, before it was backed by vendors such as Snowflake, Google, Cloudera, and AWS, which has incorporated the table format in its S3 Tables storage bucket, and became widespread among users including Apple and Netflix. Lance will need to wait to see if its arguments hit home in the same way. ®

More about

TIP US OFF

Send us news


Other stories you might like