Need to build a Big Data app but can't be bothered to learn Python or Scala? Good news: .NET for Apache Spark is here
Stay safe and warm in your C# cocoon
Good news landed today for data dabblers with a taste for .NET - Version 1.0 of .NET for Apache Spark has been released into the wild.
The release was a few years in the making, with a team pulled from Azure Data engineering, the previous Mobius project, and .NET toiling away on the open-source platform. The activity was driven by demand from the .NET community for a way to build big data applications without having to learn Scala or Python (although we'd contend the latter at least would be worth picking up if Stack Overflow's surveys are anything to go by).
The project, operated under the .NET Foundation, first went public last year at Microsoft Build 2019 and the Databricks Spark+AI Summit 2019. Twelve pre-release editions later and here we are.
Version 1.0 includes the ability to write Apache Spark applications using .NET user-defined functions (UDF) and .NET apps targeting .NET Standard 2.0 (.NET Core 3.1 or later is recommended). There is also support for Apache Spark 2.4/3.0 DataFrame APIs (including the ability to write Spark SQL) and an API extension framework to add support for additional Spark libraries.
Back in 2019 (at the time of the first preview) a doubling of performance on the TPC-H benchmark over Python was claimed for some operations. Microsoft said today little has changed in version 1.0 and users can expect things to be at least as fast as PySpark programs for apps that use UDFs.
"Often faster," noted the team modestly.
Going forward, getting the likes of Language Integrated Query (LINQ) working is important for C# programmers, particularly as it faces increasing competition from languages such as Python. Adding some extra deployment options is also a priority, as is integration with DevOps pipelines.
The project is built into Azure Synapse and Azure HDInsight, and version 1.0 will creep into the next major release. It can also be used in AWS EMR Spark and, as seems to be increasingly the norm for Microsoft nowadays, straddles Windows, macOS, and Linux if on-premises deployments are your thing. ®