10+ users can lead to washout: Data lakes struggle with SQL concurrency, says Gartner

We're working on it, Spark backer Databricks declares

Mon 24 May 2021 // 12:10 UTC

Data lakes are struggling to support more than 10 users when they try to perform the SQL queries that were once seen as only fitting for data warehouse technologies, according to Gartner.

Apache Spark is the most widely used processing engine when working with data lakes, because it's a single framework that can do batch processing that can do real-time processing, as well as machine learning, and graph processing. However, Spark is not suited to the many business users trying to query the data using SQL at the same time, said the analyst.

“Some of the challenges of data consumption from a data lake is the concurrency aspect. Heavy concurrency, even into double figures, can often bring down data lakes in terms of latency,” said Sumit Pal, Gartner analyst and senior director.

Data lake technologies have been working to make the freewheeling data they store more accessible to business users by supporting SQL. In November last year, for example, data management and machine learning framework biz Databricks previewed SQL Analytics for that very purpose. Built on Delta Lake, Databricks' open format data engine aims to bring order and performance to existing data lakes.

Meanwhile, AWS's data lake – Elastic Map Reduce – can handle SQL queries via SQL Workbench or Presto SQL. Azure supports SQL queries in its data lakes (HD Insight or Azure Databricks) while GCP uses a combination of Bigtable, Dataflow and Bigquery.

But these implementations are not able to handle the number of SQL queries supported by "traditional" data warehouses, some of which can scale to thousands of concurrent users.

Latency and concurrency an issue

Pal told the Gartner Data & Analytics Summit: "Data lakes are actually not being used for BI workloads, especially in large organizations that need high concurrency, as well as low latency. The SQL engines that have been developed on the data lakes have never really been able to keep up with the concurrency and latency requirements."

Speaking to The Register, Databricks CEO Ali Ghodsi said the concurrency issues were something the company had been aware of and was working hard to improve. "Concurrency is where things like Spark don't do well. And it's been a focus area for us.

"We were already world-class at really large warehouses: we can handle lots of lots of data and we can do it faster and better than anyone else, but when it's small and you have lots of different concurrency such as 32 users on the same warehouse, that was not necessarily our sweet spot," he said.

Ghodsi said SQL Analytics, which was first built in July last year, was not initially able to support 32 concurrent users, but results from February show it was able to handle 19,000 queries per hour from that number of users for one SQL endpoint. To support more users, a customer could spin up more end-points in the cloud, he said. ®

Topics

Special Features

Vendor Voice

Resources

Storage

10+ users can lead to washout: Data lakes struggle with SQL concurrency, says Gartner

We're working on it, Spark backer Databricks declares

Latency and concurrency an issue

More about

More about

More about

More about

More about

TIP US OFF

Other stories you might like

Uncle Sam's had it up to here with 'unforgivable' SQL injection flaws

Microsoft admits 'power issue' downed Azure services in West Europe

MongoDB's SQL-to-NoSQL converter uses AI to smash the language barrier

Reducing the cloud security overhead

If you don't brush and floss, you're gonna get an abscess – same with MySQL updates

Fivetran slammed for dropping SQL support. CEO: 'Blame me for this'

Microsoft adds silicon muscle into latest Azure SQL database configs

Python tops programming love list – but if you want a job, learn SQL

SQL Server admins warned about Fargo ransomware

Cockroach Labs CTO: Google became too comfortable, I wasn't being challenged

NoSQL player Aerospike links up with Starburst for SQL-based access to edge data

Loads of PostgreSQL systems are sitting on the internet without SSL encryption

About Us

Our Websites

Your Privacy