Databricks promises cheap cloud data warehousing at an eighth of the cost of rivals
Inertia of embedded BI and analytics a limiting factor, however
Databricks, the company born out of the Apache Spark boom, has let loose a raft of updates at its San Francisco conference, including an elastic compute option for analytics.
Databricks SQL Serverless, available in preview on AWS, has been designed to improve query performance and concurrency of BI and analytics workloads on messy data lake repositories.
The move is part of the company's plan to bring data lakes and data warehouses together on one system: the proverbial “lakehouse”, the coinage du jour achieving currency among vendors and commentators alike.
Databricks is also announcing an update to Photon, its query engine for lakehouse systems, making it available in Databricks Workspaces — the environment where users view their Databricks assets. It is releasing Open source connectors for Go, Node.js, and Python to simplify to access a lakehouse from operational applications. Meanwhile, query federation in Databricks SQL is set to let users query data in PostgreSQL, MySQL, AWS Redshift from Databricks.
Announced last year, Databricks SQL Serverless is designed to provide instant compute to users for their BI and SQL workloads. The company promised “minimal management” and capacity optimizations to lower overall cost by an average of 40 per cent.
Joel Minnick, Databricks marketing VP, told The Register that SQL Serverless will allows users coming into Databricks to “go from start up to query in three seconds”.
It also makes economic sense in the cloud, he said. “You truly only pay for what you use on a data warehouse and workloads. That is a real game-changer in terms of the cost of getting these workloads done.”
Minnick said the nearest cloud data warehouse rival would be eight times more expensive than Databricks for these types of workloads.
- NoSQL player Aerospike links up with Starburst for SQL-based access to edge data
- MongoDB wants to grab work from other databases
- Google Cloud previews new BigLake data lakehouse service
- Tiny Uber offshoot tries to do for data lakes what Snowflake did for data warehousing
The serverless option would also help address the issue of user concurrency on analytics queries, where data lakes have attracted criticism.
Minnick said Databricks had already made progress on this issue, and the vast majority of enterprise data warehousing concurrency needs would be met by Databricks SQL, he said.
Hyoun Park, CEO and chief analyst at Amalgam Insights says Databricks' Serverless SQL makes it easier to support large amounts of distributed data in a cost-effective manner. “It is a response to other vendors providing serverless SQL offerings such as Azure SQL or CockroachDB, but this should also allow Databricks customers to more easily support multi-region and hybrid multi-cloud environments. From a practical perspective, this move makes it easier for potential Databricks customers to use a lakehouse without the significant challenges of manual resource management that can potentially occur as data gets bigger and faster from many different sources to multiple different destinations.”
Park says Serverless SQL could also help users solve the concurrency challenge, but more work would be required to address it in the future.
“Realistically, this concurrency issue will probably require a bit of compromise: Databricks customers should ideally partition and structure data across multiple instances to avoid concurrency issues and Databricks will need to continue working on methods to accelerate queries, duplicate resources, cache results, and provide other resource and analytic workarounds as Databricks is used on smaller amounts of data.”
However he cautions that while Databrick's claims on price-performance compared with rivals could stand up, it would depend on a lot of different variables. "It's hard to tell if they're making an apples-to-apples comparison without license and hardware and labor specs. I'd advise users to do their own total cost of ownership analysis to support this claim."
Since it introduced the lakehouse concept in 2020, Databricks has seen some competition spring up.
In April, Google announced a preview on Google Cloud of BigLake, a data lake storage service that it claims can remove data limits by combining data lakes and data warehouses.
In Feb this year, tiny Californian startup Onehouse won $8m in seed funding with hopes to grow a business worthy of taking on the giants of data engineering. The other aim is to make data lake projects faster, cheaper and easier than before. And not to be outdone, Snowflake too has announced support for unstructured data in its data warehouse platform.
Park said the challenge for Databricks is quickly scaling in areas where there was already massive competition. “For instance, shifting to build apps on Databricks when embedded BI and analytics have been around for decades is a significant process shift,” he said.
“Although the new generation of analytics is often reduced to a "Databricks vs. Snowflake" matchup, this simplistic view ignores the practical use cases for each vendor differ based on their historical approaches to data, open-source, analytic processing, and semi-structured data.
“Although enterprise data demands are pushing Databricks and Snowflake product maps closer together, Databricks is a platform fundamentally better suited to the future of real-time analytic data across multiple varieties of data structures and formats while Snowflake is well structured for fast adoption based on the current era of rapid use of structured data for data marts and warehouses,” he said.
Park said the market for software vendors addressing analytic data problems was in a “Cambrian explosion” phase. “As these shifts occur, it would not be surprising to see new vendors arise to take on both Databricks and Snowflake by solving problems associated with hybrid cloud, multi-modal data, networking, storage, compute, low-code application development, or administration,” he says. ®