This article is more than 1 year old

Hey, Sparky: Confused by data science governance and security in the cloud? Databricks promises to ease machine learning pipelines

You know the one, that pothole ridden journey from on-prem to the fluffy white stuff

Databricks, the company behind analytics tool Apache Spark, is introducing new features to ease the management of security, governance and administration of its machine learning platform.

Security and data access rights have been fragmented between on-premises data, cloud instances and data platforms, Databricks told us. And the new approach allows tech teams to manage policies from a single environment and have them replicated in the cloud, it added.

David Meyer, senior veep of product management at Databricks, said:

"Cloud companies have inherent native security controls, but it can be a very confusing journey for these customers moving from an on-premise[s] world where they have their own governance in place, controlling who has access to what, and then they move this up to the cloud and suddenly all the rules are different."

The idea behind the new features is to allow users to employ the controls they are familiar with, for example, Active Directory to control data policies in Databricks. The firm then pushes those controls out into the cloud, he said.

The new features include user-owned revocable data encryption keys and customised private networks run in cloud clusters, allowing companies to tailor the security services to their enterprise and compliance requirements.

To ease administration, users can audit and analyse all the activity in their account, and set policies to administer users, control budget and manage infrastructure.

Meanwhile, the new features allow customers to deploy analytics and machine learning by offering APIs for everything from user management, workspace provisioning, cluster policies to application and infrastructure monitoring, allowing data ops teams to automate the whole data and machine learning lifecycle, according to Databricks.

Meyer added: "All the rules of the workspaces have to be done programmatically because that's the only way you can run things at scale in an organisation."

Databricks, whose co-founder and CTO Matei Zaharia was the original Spark author, is currently available on AWS and Azure, and although plans are in place to launch on Google Cloud Platform, "it was a question of timing," the exec added.

Dutch ecommerce and banking group Wehkamp has been using Databricks since 2016. In the last two years it has introduced a training programme to help users from across the business - from IT operations to marketing - do their own machine learning projects on Spark.

The new security and governance feature will help in support of such a large volume of users without creating a commensurate administration burden, said Tom Mulder, lead data scientist at Wehkamp. "We introduced a new strategy which was about teaching data science to everybody in the company which actually means we have about 400 active users and 600 jobs running in Databricks," Mulder said.

Examples of use cases include onboarding products for resale, by using natural language processing to help the retailer parse data from suppliers into its own product management system, avoiding onerous re-keying and saving time.

Wehkamp said he was looking forward to the new security and governance features to help manage such a wide pool of users. "The way Databricks is working to introduce the enterprise features and all the management tools, that will help a lot."

Managing data and users in a secure way, which complies with company policy and regulations, is a challenge as data science scales up from a back-room activity led by a handful of data scientists to something in which a broader community of users can participate. Databricks is hoping its new features addressing data governance and security will ease punters along that path. ®

More about


Send us news

Other stories you might like