Microsoft tries to Spark relationship with cluster lusters: Promises 5-min big data bang on Azure

Aims to have Apache Spark running in time it takes to make cuppa

First apps on Windows, then Linuxes in Hyper-V and on Azure, now big data via Spark. In another effort to win over the open source crowd, Microsoft has made the speedy big data engine Apache Spark easier to set up and use on Azure, giving devs a dedicated tool to help provision clusters.

The open-source "Azure Distributed Data Engineering toolkit", which integrates with Docker containers, enables devs to submit jobs and provision on-demand Spark clusters from the command line.

Apache Spark boasts that it can "run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk". The tricky part can be configuration – companies such as MemSQL have approached this by offering a way to use it without writing code.

"Spinning up a Spark cluster, on-demand, can often be complicated and slow," Microsoft program manager JS Tan wrote in an Azure blog post. "Spark developers often share [static] pre-existing clusters managed by their company's IT team," which means "you're either out of capacity, or you're burning dollars on idle nodes."

The new toolkit, based on Azure Batch, can provision a cluster within three to five minutes, according to Microsoft. As usual, you'd still pay for the cores you use.

For now, this toolkit is Spark-specific. But Tan added: "We plan to support other distributed data engineering frameworks in a similar vein." ®

Similar topics


Send us news

Other stories you might like