Google Cloud Platform will soon emit the alpha release of its Dataproc service, specifically for Apache Spark jobs, running on Google Kubernetes Engine (GKE) clusters.
Apache Spark is a cluster computing framework designed for use as a processing engine for ETL (Extract, Transform, Load) or data science applications. It is often used with Apache Hadoop, which provides Hadoop YARN (Yet Another Resource Negotiator) for managing resources and HDFS for distributed data storage. Google's Dataproc service offers Hadoop and Spark on Google Cloud Platform. The service is similar to managed Hadoop distributions on AWS, which has Amazon EMR (Elastic Map Reduce) and Microsoft Azure, which has HDInsight.
Google, which created Kubernetes (K8s) for orchestrating containers on clusters, is now migrating Dataproc to run on K8s – though YARN will continue to be supported as an option.
What's wrong with YARN? "It's a fairly heavyweight stack," James Malone, Google Cloud product manager, told The Reg. "People have been trying to replace YARN for quite a while. YARN was initially designed to run on bare metal, it has been adapted to run on VMs, but the YARN management layer is moving at a slower pace than K8s, which has driven a lot of customer interest in K8s."
Malone says there are "a few customer pain points with YARN" and that "using containers in YARN right now is fine but not great. It wasn't designed from the ground up for containerized workloads."
Google's solution started with the development of an open-source Spark on Kubernetes operator.
"We did this as a first step to start moving the ecosystem to start running on Kubernetes. You can run Spark on K8s anywhere and that's OK with us," said Malone.
Running Spark on K8s will give "much easier resource management", according to Malone. This includes features like auto scaling and auto healing.
Moving to K8s does require containerization of your Spark code but Google reckons this is a good thing. "If I actually take my Spark code and containerize it, it gives you easier development, test and production lifecycles. It also means if a new version of Spark comes out, you don't necessarily have to wait for the entire distribution to be updated. You just update that component and use it," said Malone.
Today's release of Spark on K8s is alpha, so for testing and experimentation only. "We're starting with Spark. We will offer eventually two versions of Dataproc. One will be YARN based, which will exist for the foreseeable future. We will also have a Kubernetes flavour and we expect a lot of new build development will be on K8s because it offers a better development experience," Malone told us.
Another implication is that integrating components from third-party vendors may be simplified. According to Malone: "It's doable to install customised vendor components on the YARN ecosystem, but it’s not a great experience. Kubernetes makes that much easier." He further remarked that "this is also where we see the business model for a lot of open-source components heading. Instead of trying to have these companies own the end-to-end experience, we're trying to remove the need for them to go and develop things like cluster-management tools and deployment tools."
In other words, Malone foresees third-party component vendors relying on public cloud platforms like Google's, something that sounds good for Google but not necessarily good for the third parties.
Spark is only one of the projects for which Google is creating K8s operators. Others include Apache Flink, Presto and Apache Druid.
What if you want to convert an existing Dataproc project to use K8s? Malone said it should be straightforward. "The programming models don't change. Spark is Spark. You may have to package up and containerize your workload but generally that's not a difficult, and once you do that your code is a lot more portable."
On the management side, using Google's Dataproc API or cloud console will be easier than having to deal directly with K8s. "We also will support using the K8s command line tools if you want, but most customers end up using our client tools because they are fully supported, they have a lot of extra error checking, they are designed to support most of what customers want to do," said Malone.
A limitation in the initial release is that "Dataproc will be creating Spark resources on an existing K8s cluster," Malone told us. "But considering Dataproc can create and destroy cluster resources on its own today, it's just a matter of time before we support the creation and destruction of K8s clusters."
In the case of hybrid or multi-cloud scenarios, Google's Anthos project gives you flexibility about where to run your code. "Our vision is that Dataproc is that single control plane whether you're running on Google Cloud or on-prem in Anthos or in another cloud in Anthos," said Malone.
The momentum behind K8s makes this kind of transition inevitable. Google Cloud Platform trails behind AWS and Microsoft Azure in market share, so it is also natural for it to take advantage of its position as the inventor of K8s to push its further use. ®