Google flaunts concurrency, optimization as cloud rivals overhaul platforms
Details sub-CPU allotments, performant Iceberg tables after Microsoft, Databricks bring market noise
Feature Last year was a big one for data analytics and ML in the cloud. Two of the biggest players, Microsoft and Databricks, both overhauled their platforms, with the former also managing to launch products.
Google, which as you'd expect is a big player in the cloud data analytics market, has scored customer wins with Walmart, HSBC, Vodafone, and Home Depot among others in the last few years, in some cases displacing well-established on-prem enterprise data warehouse systems from companies such as Teradata.
In terms of new tech, Google made additions and tweaks to its line-up in 2023 rather than the major platform announcements we saw from Microsoft and Databricks. Google's data warehouse BigQuery got auto-scaling and compressed storage, together with more choice and flexibility in setting up features for various workload requirements. Customers could also mix Standard, Enterprise, and Enterprise Plus editions to achieve their preferred price performance by workload. BigQuery Data Clean Rooms allowed the sharing and matching of datasets across organizations while respecting user privacy and upholding data security.
Postgres pioneer Michael Stonebraker promises to upend the database once moreREAD MORE
In AlloyDB Omni, Google offers PostgreSQL-compatible database services which work across the other cloud hyperscalers, on-prem and developer laptops. It includes a bunch of automation tools to help with migration from older, well-established database systems such as Oracle or IBM Db2.
But in terms of the data platform, where the main players serve up structures and unstructured workloads for BI, analytics and machine learning from a single place, adopting the suspect "lakehouse" terminology, Google already has what it needs to compete, Gerrit Kazmaier, veep and general manager of Google data analytics, tells The Register.
"You have the large analytical systems building these wide data records. It's very important to have them not only intertwined but, actually seamlessly integrated for instance, where you're not even replicating data right from one system to another: BigQuery is talking to the same data in the same location as a database writes it to. There is zero latency, there is zero overhead, there was no mirroring or replication required because basically you have access everywhere," Kazmaier says.
In Google's architecture, a unified access layer for security and governance links applications such as BI, data warehousing and ML to a backend, which is served by BigQuery Managed Storage and Google Cloud Storage and multi-cloud storage from AWS S3 and Microsoft's Azure Storage.
The architecture, in concept at least, is similar to Microsoft’s offering. Announced in June and becoming generally available in November, Microsoft Fabric also promises to serve various applications and workloads from its OneLake technology, which stores everything in the open-source, Linux Foundation-governed Delta table format, which originated with Databricks.
Microsoft explains that the approach allows applications such as Power BI to execute workloads on the Synapse data warehouse without sending SQL queries. Instead, a virtual data warehouse is created in Onelake, which loads the data into memory. The Redmond giant claims the approach offers performance acceleration because there's no more SQL tier in the middle of executing SQL queries.
Kazmaier says: “We took decades of innovations in BigQuery, specifically in query performance, access times, query optimization, and delivered them by a BigLake in a way so customers can get performance as well as the richness of the development from the Iceberg community. Specifically we have many optimizations from how we access and understand metadata from how we access files, which lead to superior performance with Iceberg and BigQuery on GCP,” he says.
- Databricks' lakehouse becomes foundation under fresh layer of AI dreams
- Snowflake puts LLMs in the hands of SQL and Python coders
- TileDB secures $34M to reimagine databases, not just collect GitHub stars
- Microsoft, Databricks double act tries to sew up the data platform market
While all the main vendors in the space say they do, or will, support all the table formats — Iceberg, Delta and Hudi — built on the Apache Parquet file format, each has its emphasis on which they support “natively”. The trend has led to a split in the industry, with Databricks, Microsoft, and SAP backing Delta and Google, Cloudera, Snowflake, AWS and IBM’s Netezza emphasizing Iceberg.
Kazmaier says Google’s support for Iceberg was down to a strong commitment to open source. "Iceberg is an Apache project: it is very clearly governed, it's not linked to any vendor, and there is a broad contribution from the community."
He says Google was reacting to customer demand in picking Iceberg as the "primary data strategy format," but it also added support for Delta and Hudi as some customers have already built a Databricks-centric stack.
"The real answer lies in how flexible you want to be as a customer. If you choose to be the most flexible and open, Iceberg gives you the broadest of these qualities. If you're more concerned about having a lakehouse architecture from a Databricks-centric deployment, Delta is a fine choice. We see very fast and board adoption of Iceberg," he says.
Last month, Databricks, the data platform company which grew out of Apache Spark data lakes, also announced a major overhaul its stack. It promises a new "data intelligence" layer on top of the "lakehouse" concept, which it launched in early 2020 to combine structured BI and analytics workloads of data warehousing with the messy world of data lakes. In an announcement spared product details, the company said it is introducing the "data intelligence" layer DatabricksIQ, to "fuel all parts of our platform."
While retaining the lakehouse's unified governance layer across data and AI and a single unified query engine to span ETL, SQL, machine learning and BI, the company wants to move on to exploit the technology gained in its $1.3 billion buy of MosaicML, a generative AI startup. The idea is to employ "AI models to deeply understand the semantics of enterprise data," Databricks says.
Although Databricks' lakehouse supports SQL queries, there has been some criticism of its ability to support BI workloads at enterprise scale. In 2021, Gartner pointed out that cloud-based data lakes might struggle with SQL queries from more than 10 concurrent users, although Databricks disputed the claim. Last month, Ventana Research analyst Matthew Aslett said more organizations are becoming aware of the difficulties as they attempt to scale data lakes and support enterprise BI workloads.
For example, Adidas has built a data platform around Databricks, but also created an acceleration layer with the in-memory database Exasol to improve performance on concurrent workloads.
Kazmaier explains that Google’s approach to concurrency avoids spinning up more virtual machines and instead improves performance on a sub-CPU level unit. “It moves these capacity units seamlessly around, so you may have a query which is finishing and freeing up resources, which can be moved immediately to another query which can benefit from acceleration. All of that micro-optimization takes place without the system sizing up. It's constantly giving you the ideal projection of the capacity you use on the workloads you run,” he says.
A paper from Gartner earlier last year approved of the approach. "A mix of on-demand and flat-rate pricing slot reservation models provides the means to allocate capacity across the organization. Based on the model used, slot resources are allocated to submitted queries. Where slot demand exceeds current availability, additional slots are queued and held for processing once capacity is available. This processing model allows for continued processing of concurrent large query workloads," it says.
While Microsoft and Databricks may have caught the market's eye with their 2023 data stack announcements, Ventana’s Aslett reckons there was little to choose between the main players, and any apparent technology lead can be down to release cadence.
Looking ahead to the coming year, Google might hope to steal some of the recent limelight back from its rivals. ®