Databricks: Ugh, just look at that messy data lake environment. Squints. You know... we could sort that out with a sweet shot of SQL
Data-wrangler previews another lakehouse concept tool
Data management and machine learning framework biz Databricks is launching a tool it has claimed will bring SQL-style analytics to the messy world of data lakes.
SQL Analytics, the company claimed, expands the traditional scope of the data lake from data science and machine learning to include all data workloads including business intelligence and SQL. It is available for preview this week.
The tool is a manifestation of the company’s lakehouse concept, which, you’ve guessed it, is an attempt to bring some of the governance, performance and order from the data warehouse world to the wild and messy world of data lakes, which have the advantage of being able to ingest unstructured data quickly.
Speaking to The Register, Joel Minnick, Databricks product marketing veep said: “Despite it being a little bit of a whimsical name for an architecture, lakehouse is probably the best way to articulate what the architecture is.”
SQL Analytics is built on Delta Lake, Databricks’ open format data engine supposed to help bring order and performance to existing data lakes. It also uses Delta Engine, a “polymorphic query execution engine,” which rewrites Spark into C++ to take advantage of vectorisation, Minnick said. Apache Spark is written in Scala.
The idea, said Minnick, is that it allows users to auto-scale clusters that are structured to be high-performance SQL analytics clusters, which in turn is supposed to allow organisation to handle high user concurrency (many logged-in users) “behind the scenes”.
Databricks had also “done some engineering” to govern how queries were trafficked and executed to keep back and forth communication to a minimum, thereby reducing latency, he said.
Those familiar with SQL analytics or data engineering can explore the schema of their Delta Lake tables, to be able to “run SQL queries, and visualize the results,” Minnick said.
While the Databricks SQL Engine might help bring BI work to the data lake, and help users get value from that messy repository of data, it is unlikely to replace established enterprise data warehouses any time so, opined Philip Carnelley, associate vice president of software research at IDC.
“The idea is to give you the best of both worlds, there is some merit to that. But this is a solution for companies with lots of technical resources. This will run alongside other enterprise data tools. It might be that people use data warehouse systems like Teradata a bit less, because they have these tools as well, but they are not going to switch off the data warehouse any time soon,” Carnelley said.
Databricks was one of the main vendors behind Spark, a data framework designed to help build queries for distributed file systems such as Hadoop. Matei Zaharia, DataBricks' CTO and co-founder, was the initial author for Spark. ®