This article is more than 1 year old
10+ users can lead to washout: Data lakes struggle with SQL concurrency, says Gartner
We're working on it, Spark backer Databricks declares
Data lakes are struggling to support more than 10 users when they try to perform the SQL queries that were once seen as only fitting for data warehouse technologies, according to Gartner.
Apache Spark is the most widely used processing engine when working with data lakes, because it's a single framework that can do batch processing that can do real-time processing, as well as machine learning, and graph processing. However, Spark is not suited to the many business users trying to query the data using SQL at the same time, said the analyst.
“Some of the challenges of data consumption from a data lake is the concurrency aspect. Heavy concurrency, even into double figures, can often bring down data lakes in terms of latency,” said Sumit Pal, Gartner analyst and senior director.
Data lake technologies have been working to make the freewheeling data they store more accessible to business users by supporting SQL. In November last year, for example, data management and machine learning framework biz Databricks previewed SQL Analytics for that very purpose. Built on Delta Lake, Databricks' open format data engine aims to bring order and performance to existing data lakes.
- HPC storage bods at DDN fish for BIG CATCH in the data lake
- Can Teradata avoid being grounded by on-prem legacy? Actually it helps in avoiding nasty cloud costs, says CEO
- Analyse this: Microsoft promises OLAP-OLTP 'Link' with new CosmosDB features
- Funding frenzy from AWS, Microsoft, Google, Salesforce pumps ex-Hadoop wrangler Databricks' value to $28bn
- Microsoft teases Azure Data Explorer connector for picking its Synapse analytics service's brains
- Scottish cops dangle £6m for help understanding 160TB treasure trove of structured and unstructured data
Meanwhile, AWS's data lake – Elastic Map Reduce – can handle SQL queries via SQL Workbench or Presto SQL. Azure supports SQL queries in its data lakes (HD Insight or Azure Databricks) while GCP uses a combination of Bigtable, Dataflow and Bigquery.
But these implementations are not able to handle the number of SQL queries supported by "traditional" data warehouses, some of which can scale to thousands of concurrent users.
Latency and concurrency an issue
Pal told the Gartner Data & Analytics Summit: "Data lakes are actually not being used for BI workloads, especially in large organizations that need high concurrency, as well as low latency. The SQL engines that have been developed on the data lakes have never really been able to keep up with the concurrency and latency requirements."
Speaking to The Register, Databricks CEO Ali Ghodsi said the concurrency issues were something the company had been aware of and was working hard to improve. "Concurrency is where things like Spark don't do well. And it's been a focus area for us.
"We were already world-class at really large warehouses: we can handle lots of lots of data and we can do it faster and better than anyone else, but when it's small and you have lots of different concurrency such as 32 users on the same warehouse, that was not necessarily our sweet spot," he said.
Ghodsi said SQL Analytics, which was first built in July last year, was not initially able to support 32 concurrent users, but results from February show it was able to handle 19,000 queries per hour from that number of users for one SQL endpoint. To support more users, a customer could spin up more end-points in the cloud, he said. ®