This article is more than 1 year old
Microsoft touts mirroring over moving in data warehouse gambit
Fabric update cuts against the grain, and may have more to do with Databricks partnerships
Ignite Microsoft is advising customers using its Fabric platform to copy data from other data warehouses and analytics systems in a move against the prevailing industry trend.
Fabric – which encompasses data warehouse, data lake, analytics, BI, and machine learning – was launched earlier this year, promising to address "every aspect of an organization's analytics needs."
At the Redmond software giant's Ignite conference this week, Microsoft announced its general availability, as well as a few new features.
You get a Copilot, and you get a Copilot – Microsoft now the Copilot company
MORE IGNITEAmong them is Mirroring, a way to add and manage existing cloud data warehouses and databases in Fabric's Synapse Data Warehouse system. Microsoft said Mirroring replicates a snapshot of the external database to OneLake in Delta Parquet tables and keeps the replica synced in "near real time."
From there, users can create shortcuts to allow other Fabric workloads – connectors, data engineering, building AI models, data warehousing – to use the data without moving it again. Microsoft promised Azure Cosmos DB and Azure SQL DB would be able to use Mirroring to access data in OneLake, while cloud-based data platform provider Snowflake and NoSQL database MongoDB customers will be able to do the same.
The move goes some way to executing a trend seen in the data warehouse and analytics space over the last couple of years. By supporting the Delta table format, other compatible analytics engines will be able to access and use the data in OneLake without moving it.
Delta is supported by application giant SAP and Databricks.
But others have adopted a different table format – Apache Iceberg – for a similar objective. They include Snowflake, Cloudera, and Google's BigLake.
Iceberg and Delta are effectively metadata layers on the Apache Parquet data storage format.
Although both formats – as well as Apache Hudi – are designed to help bring analytics engines to the data, avoiding the cost of moving it, Microsoft argues that copying data from other sources is necessary to get better performance.
Microsoft Ignite at a glance
- The Register: You get a Copilot, and you get a Copilot – Microsoft now the Copilot company
- The Next Platform: Microsoft holds chip giants' feet to the fire with homegrown CPU, AI silicon
- The Register: Microsoft's Swiss army knife app aims to cut through cloud clutter
- The Register: Microsoft takes aim at on-desk, non-cloudy developers with Windows AI Studio
- The Register: Databricks' lakehouse becomes foundation under fresh layer of AI dreams
- Microsoft: All of Redmond's announcements – and event homepage and sessions.
Speaking to The Register, Arun Ulag, chief vice president of Azure Data, said the idea behind Mirroring was to allow customers who have data sitting in proprietary databases and data warehouses, like Snowflake, for example, to create and maintain a replica OneLake.
Although it might require storing the data in two places, Ulag argued there would be performance advantages.
"The majority of the Snowflake data is not sitting in Iceberg," he said, "but in their own proprietary database. Like other data in a proprietary data format, the only way to touch the data is to go through the SQL interface, which drives up costs for customers. It also means that there's another tier of execution which slows down performance."
Copying the data to Fabric Power BI, for example, doesn't even have to send SQL queries to Snowflake because the data is sitting in Apache Parquet and Delta Lake, OneLake's native format. "It will simply go to OneLake and paste it into memory when queries come in," Ulag said. "It gives you a significant performance acceleration because you know you're eliminating that whole SQL execution."
James Malone Snowflake director of product management, told The Reg: "At Snowflake we believe in eliminating copies of data in order to simplify governance and have greater efficiencies. The needs of our customers vary greatly, so our approach is to give customers options that align with their needs.
"Many customers find great value in simplicity, security, and performance by loading data into Snowflake with our fully managed format. And some use cases prioritize interoperability, in which case we're supporting Iceberg so that it's fully open and just works in customers' storage across any of the clouds Snowflake supports, including Azure," Malone added.
- Microsoft Fabric promises to tear into the enterprise analytics patchwork
- Tabular's Iceberg vision goes from Netflix and chill to database thrill
- Databricks shakes VC money tree and $500M falls out
- Apache Iceberg promises to change the economics of cloud-based data analytics
One industry expert said Microsoft would need to copy the data to get better query performance until it natively supports Iceberg, which it said it would in the future. It's also possible that Microsoft believes it can manage the data better than Snowflake to get better query performance via the way it controls clustering, they said.
Hyoun Park, CEO and chief analyst with Amalgam Insights, said: "Microsoft would be glad to take any Parquet files and put them into a Microsoft data lake and would be glad to take any Snowflake data that it can get in the process."
But behind the scenes, there may be reasons Microsoft is focusing on Delta rather than Iceberg for the time being.
"We know that there is only one major company that has focused on the Delta Lake format so far, and that is the powerhouse startup Databricks," Park said. "There is an Azure Databricks product as well, and it has been doing very well. In fact, it may be the most successful product on Microsoft Azure. Our data shows it is currently a multibillion-dollar business when considering the data lake and associated analytic and machine learning workloads.
"Microsoft has made no secret of the fact that it is staking a lot of its near-term growth on AI. This means that Microsoft wants to be able to support a Delta Lake format, and do as much of the work themselves on their own infrastructure and resources."
Park said Microsoft also has a lot of Azure cloud business that is directly reliant on Databricks and would want to make sure it do everything possible to not lose that business. "Although Iceberg is a more prevalent data lake standard, when looking across the IT vendor landscape, Databricks has been very successful in providing machine learning infrastructure at the data level," he said.
However, he said Microsoft would eventually be a significant Iceberg contributor as well.
At Ignite, Microsoft said it would extend its Copilot chatbot to Fabric. Now in public preview, the move promises to allow data scientists to use natural language to create dataflows and pipelines, write SQL statements, build reports, and develop machine learning models. ®