Microsoft Fabric promises to tear into the enterprise analytics patchwork
Meanwhile, users are left to figure out how to cut their cloth
A relative newcomer to the enterprise data and analytics world, Microsoft didn't hold back when it launched its Fabric platform last month.
Against companies like SAS and Teradata – with more than 95 years of history between them – the Redmond software giant, which only launched its Synapse data warehouse in 2019, promised to address "every aspect of an organization's analytics needs."
It is a bold claim to make to organizations whose needs may already be being served by complex layers of vendors, technologies, and architectures, each serving different business needs or user populations.
Microsoft's decision to jump in with both feet was foreshadowed by some moves by other big hitters in providing cloud-based data lakes, warehouses, and analytics.
In January last year, cloud-based data warehouse company Snowflake announced external table support for Apache Iceberg in private preview, followed by general availability in the summer. Cloudera followed suit in July, while Google announced its support for the open source table format in October last year.
All this matters because it promises change the economics of analytics, allowing users to bring analytics to the data rather than expend the money and effort moving data into a specific repository.
Now Microsoft is doing something similar, in a slightly different way. The company has announced its support for the table format Delta, which is open source via the Linux Foundation, but gets the majority of its contributions from Databricks, the AI and analytics company once famed for backing unified analytics engine Apache Spark. SAP also backed Delta through its partnership with Databricks, although both companies said they would support Iceberg and Hudi, another table format, in the fullness of time.
But Microsoft went with Delta owing to market demand, Arun Ulag, corporate vice president of Azure Data, told The Register.
"If you bring data into the data warehouse, it's putting data in its own proprietary format, which from a customer perspective is not great because they feel locked-in: each time they touch their own data, they have to pay somebody to be able to do that. So, in Fabric that goes away. The native format for Fabric is the open source data format, which from a customer perspective has been really exciting because if it liberates the data, it allows them to use the entire ecosystem of open source tools against the data," he said.
Although support for Iceberg and Hudi will be coming externally, Ulag explained that, by default, Microsoft Fabric would favor Delta and Apache Parquet, the column-oriented data file format.
"We have introduced in Fabric our native format, by default is Delta and Parquet," he said. "It is a big deal because it's not an external table. It's not something that, if the data exists, you link to Fabric. You build a data warehouse and by default, the data is in Delta-Parquet. That's a huge step forward because we've had to do a lot of performance optimizations to make sure that the kind of performance we can deliver on Delta Parquet is industry-leading."
While Fabric will be able to link to and access data held in Delta-Parquet – and eventually other formats – elsewhere, there were cost and performance advantages in doing it all in Fabric.
Microsoft Fabric uses a virtualized data lake called OneLake, which is built on the existing Azure Data Lake Storage Gen 2 but adds shortcuts to data in AWS S3 and, soon, Google Storage. There are seven core workloads in Microsoft Fabric: Data Factory (connectors), Synapse Data Engineering (authoring for Apache Spark), Synapse Data Science (build AI models), Synapse Data Warehousing, Synapse Real Time Analytics, Power BI, and Data Activator (monitoring data and triggering notifications and events).
The advantages of working in Delta come with combining these workloads, Ulag claimed.
"You use Power BI at a Synapse data warehouse and Power BI does not even send SQL queries to Synapse anymore in fabric," he said. "It simply goes to Onelake and pages the data into memory, which then gives customers massive performance acceleration because there's no more SQL tier in the middle of executing SQL queries. Power BI is simply working with the data in Onelake, because that's its native format. It's also a huge cost reduction for customers, because there's no SQL queries to be paid for."
Microsoft calling its product Fabric is bound to introduce some confusion because – for good or for ill – the industry has coalesced around the concept of a data fabric independent of vendor products.
Robert Thanaraj, Gartner director for data management, explained that organizations which find too many copies of data, too many siloed stores, with too little shared information about the nature of that data shared in a consistent manner might find the data fabric concept appealing.
- Apache Iceberg promises to change the economics of cloud-based data analytics
- Structured data, unstructured data: It shouldn't matter, says Google
- Databricks promises cheap cloud data warehousing at an eighth of the cost of rivals
- Cloudera adopts Apache Iceberg, battles Databricks to be most open in data tables
"It's the human-centric approach to data analytics and AI. With a data fabric, organizations are looking at getting an enterprise view of what exactly is happening, within my systems, within my business processes and within the different teams," he said.
Gartner has estimated that by 2025, chief data and analytics officers will have adopted data fabric as a "driving factor in successfully addressing data management complexity, thereby enabling them to focus on value-adding digital business priorities."
While it was true that Microsoft's Fabric products could create performance and cost advantages by creating shortcuts to data, rather than moving it, those advantages would not be retained when accessing data outside the Fabric environment.
Users already working with Iceberg or Hudi would need to move to gain the cost and performance advantages of Fabric.
"You may be able to create shortcuts, but for performance reasons, you will need to migrate. It's one thing to make sure you connected all the plugs, but it's another to go live at scale for my enterprise. It's a whole new ballgame. Can it work? Yes, it can. Will that be enough? I don't think so," Thanaraj told The Register.
Suffice to say, Microsoft is not the only vendor with a desire to become the locus of control in an enterprise data strategy that contains many moving parts. Snowflake, Cloudera and Google have already staked their claim.
As the dominant cloud platform, AWS has its own approach. Ganapathy Krishnamoorthy, AWS vice president of analytics services, said taking a one-size-fits-all approach to analytics eventually leads to compromises.
As an alternative, "Amazon S3 offers integration with all AWS services, delivering proven stability and security at any scale."
Krishnamoorthy said Amazon S3 customers could use the open data format of their choice, including Apache Iceberg, Hudi, and Delta Lake. "AWS supports all three major table formats and provides guidance to help customers select an open table format based on their unique needs," he said.
He claimed Redshift offered five times better price-performance than other cloud data warehouses.
Google declined the opportunity to put forward an interviewee.
While Microsoft threatens to shake up the market in enterprise data products, it is too early to judge if Fabric, currently only available in preview, would meet customer expectations, Gartner's Thanaraj said.
"It will take another 12 months before this product could be GA. You need to see if there will be proven about the level of maturity of this product, with system integrators not just depending upon Microsoft. Just be aware of it. If possible, do a prototype, explore and taste. Take a first-hand view, but don't jump yet," he said.
Ian Cowley, head of data engineering at consultancy Ensono, said Microsoft's decision to pick Delta over Iceberg was simply a sign of the customers' preference and the maturity of the format. Other formats would be supported in time, he said.
But the vendor's plan to support a disparate set of technologies with open formats could ultimately see market consolidation based on the users primary cloud providers, he said.
"It does look arrow-shaped, because they were very fragmented five years ago. But if you think about it, all of these platforms have some sort of Spark equivalent, we are using more common open source file types like Iceberg and Parquet.
"They are separate, but more and more that they're all headed in the same direction. There will be some kind of unification eventually."
In the end, the fabric that was designed to knit together different data sources and analytics environments may be the thread that leads to greater consolidation in the market. ®