Databricks, the commercial company founded around the popular Apache Spark data lake, is making a strike for new class workloads and enterprise data management jobs in its make-or-break IPO year.
Hawking technology news from the company’s Data + AI Summit, CEO Ali Ghodsi spoke to The Register about the new technologies. Ghodsi said the firm's efforts to combine the order and SQL queries familiar to data warehousing with the schema-less architecture of data lakes would be pushed more aggressively against established vendors in the data management, analytics and data warehousing.
In 2018, when launching the unified analytics concept, Databricks was promoting it for customers looking into the machine learning lifecycle, but this approach was not aggressive enough, Ghodsi said.
"We tiptoed around it to not upset the big vendors and the data warehouses. We knew we were sitting on kryptonite, and we were hiding it because we thought that it would be too upsetting for people, that it would be too competitive to everybody," he claimed.
Databricks was co-founded in 2013 by a team of academics that met at Berkeley, including computer scientist Matei Zaharia, who developed Spark as a PhD thesis in 2009 and later co-created the Apache Mesos cluster manager. Ghodsi remains an adjunct assistant professor at the University of California institution.
In 2019, Databricks introduced Delta Lake, an open-source project, designed to address data lake reliability and addressability issues, which had caused the unflattering description of "data swamp" to gain traction.
Then in February last year, Databricks introduced the term "lakehouse" to the unsuspecting technology lexicon. This - you guessed it - was an effort to put across the idea it would combine the best of the data warehouse and data lake approach.
Fast-forward to November, and Databricks launched SQL Analytics, built on Delta Lake, Databricks’ open format data engine supposed to help bring order and performance to existing data lakes. It also uses Delta Engine, a “polymorphic query execution engine,” which rewrites Spark, written in Scala, into C++ to take advantage of vectorisation. Within Delta Lake, Databricks introduced propriety Photon, the Spark-compatible execution engine designed to accelerate Spark SQL workflows.
“We already had it from the beginning: we called it Unified Analytics. Basically, unify all your analytics: advanced analytics, all the way down to basic analytics: SQL. But it's very hidden and people didn't know it. Lakehouse is sort of the same thing but now it's in your face: data lake, plus data warehouses, combined together. Lake for AI; warehouse for BI, you get the best of the AI and BI in one platform, one copy of the data in an open platform,” Ghodsi said.
Into the mix, Databricks last week added Delta Live Tables aimed at easing ETL, a common use for Spark, by “abstracting away the low-level instructions, removing many potential sources of error,” Databricks said.
Meanwhile, Unity Catalog, industry-standard ANSI SQL, is designed to offer one interface to access both structured and unstructured data, across all cloud data lakes, in a bid to help users geta single view of their data on the Databricks Lakehouse Platform.
Databricks has launched an open-source project called Delta Sharing, which will be donated to the Linux Foundation. Databricks claims it is the world’s first open protocol for securely sharing data across organizations in real-time, completely independent of the platform on which the data resides. It is supported by AWS, Google Cloud, and BI and visualisation firm Tableau.
Sanjeev Mohan, Gartner veep and analyst, said Delta Live Tables was the “crowning jewel” for Databricks. “It makes the process of creating reliable data pipelines a declarative one – like SQL. You specify the destination and don’t worry about the underlying code which is system generated,” he said.
Unity Catalog was also a good move from Databricks, as “data catalogues have been all the rage for many years.” The vendor “has a very ambitious roadmap to enhance its functionality,” Mohan said.
Delta Sharing was interesting, he added, because most existing data sharing technologies requires the user to have an account on that platform before they can partake in its capabilities. “Databricks’ Delta Share removes that requirement,” the Gartner veep said.
While products from Microsoft and Google are also aimed at unifying the world of data lakes and warehouses, there are differences in emphasis, he said. “Some products are aimed at the data analyst persona but Databricks is aimed at helping data engineers deliver faster and more reliably,” he said.
IDC: Not so easy to peel users away from other vendors
But Philip Carnelley, associate veep, software research at IDC Europe, said that by sharing performance data on features like concurrency, an area where established data warehousing firms play well, Databricks was trying to convince the market it was a serious player. But there was still inertia in favour of incumbent vendors.
“If you’ve been using something like Teradata for 30 years and you know it works, then it is important, you’re not going to move off it lightly,” the analyst said.
While Databricks said users could add capacity in the cloud, that always comes at a cost. “It is cost-performance, not just performance that is interesting here. I think that the Teradata of this world, can give assurances there because there is a lot of experience [in optimisation] that they can draw on,” he said.
Ghodsi told The Register it was the company’s aim to be “IPO-ready” this year. In the build-up to the big day, the company took a $1bn investment round in February, with AWS, Microsoft, Google, Andreessen Horowitz (Netscape founder Marc Andreessen's VC firm), and Salesforce Ventures chipping in. The splurge took the nominal value of the fledgling firm to $28bn.
Databricks is obviously hoping the narrative which sees it spreading its wings from its data lake home to more general analytics and BI technology is a story that will fly with the market. ®