Informatica hopes to unclog your data pipelines with help from Nvidia in accelerating Spark-based ML operations

Any significant improvement in processing times will be a boon to productivity, say analysts

Informatica has announced a serverless, Spark-based data integration engine intended to accelerate data engineering for machine learning in the cloud using Nvidia GPU processors.

Within the vendor's integration platform-as-a-service, CDI Elastic, microservices support different data management activities. The new capability is targeted at extract, transform, load-type (ETL) workloads.

"You may have data sitting in a data lake, and you want to do parallel processing of that data at high performance. That's what CDI Elastic services designed for," said Rik Tamm-Daniels, VP strategic ecosystems.

Informatica is now giving customers the option of running that processing in GPUs, where their cloud provider offers Nvidia processors, which Informatica claimed could offer five times the processing speed for data preparation used in analytics, machine learning, and data science projects.

Also new to the service is support for Nvidia's Rapids suite of software libraries for data science.

"CDI Elastic takes your visually no-code design mappings and translate it into Spark-native code that takes advantage of the parallel processing power of Spark," Tamm-Daniels said. "This new capability takes advantage of what's called the Rapids open-source library that Nvidia makes available to be able to take those Spark-based jobs and optimise them for running on GPUs."

Crucially, Rapids supports Python, one of the preferred programming languages for data science and engineering, and Apache Arrow, which defines a language-independent columnar memory format for flat and hierarchical data, said Mark Beyer, distinguished vice president and analyst with Gartner.

"The fact that Nvidia is putting together the acceleration on the GPUs, with what is basically Arrow, supporting the Python, is interesting," he said.

"One of the areas right now that's difficult in terms of building data pipelines: taking stuff from data science and putting it in production. So, when you formalize those libraries, that's going to accelerate production.

"Hardware acceleration to a data science team is not specifically interesting, but to data engineers and the infrastructure designers, it's like, 'Thank god somebody put some discipline in here'."

Bayer said he had not been able to verify Informatica's performance claims, though machine learning projects spend 90 per cent of their time or more on data engineering, so any significant improvement in data processing could have a large impact on the productivity of data scientists.

Kevin Petrie, vice president of research at analyst firm Eckerson Group, said the efforts of machine learning operations (MLOps) were focused on streamlining the software lifecycle of creating, training, deploying, and monitoring ML models.

The Nvidia/Informatica partnership reaches down the stack to ease a different, but related, bottleneck of performance delays due to massive data volumes, he added. "These delays are choking machine learning and other AI initiatives that suck up lots of data as they train and retrain models for accuracy."

"We should expect more innovation like this as enterprises and vendors take more of a full-stack approach to the operationalisation of machine learning." ®

Biting the hand that feeds IT © 1998–2021