Despite 'key' partnership with AWS, Meta taps up Microsoft Azure for AI work
Someone got Zuck'd
Meta’s AI business unit set up shop in Microsoft Azure this week and announced a strategic partnership it says will advance PyTorch development on the public cloud.
The deal [PDF] will see Mark Zuckerberg’s umbrella company deploy machine-learning workloads on thousands of Nvidia GPUs running in Azure. While a win for Microsoft, the partnership calls in to question just how strong Meta’s commitment to Amazon Web Services (AWS) really is.
Back in those long-gone days of December, Meta named AWS as its “key long-term strategic cloud provider." As part of that, Meta promised that if it bought any companies that used AWS, it would continue to support their use of Amazon's cloud, rather than force them off into its own private datacenters. The pact also included a vow to expand Meta’s consumption of Amazon’s cloud-based compute, storage, database, and security services.
The AWS-Meta team up also included a collaboration to optimize workloads using the PyTorch machine learning framework — which Meta, then Facebook, released in 2016 — for deployment in the cloud provider’s Elastic Compute Cloud and SageMaker services.
- Meta releases code for massive language model to AI researchers
- Facebook opens political ad data vaults to researchers
- Meta won't migrate future acquisitions out of AWS
- Zuckerberg sued for alleged role in Cambridge Analytica data-slurp scandal
It appears Meta is more than happy to play the field, though, deploying workloads wherever it pleases. Guess that's what they call multi-cloud; it also demonstrates the difference between "key" and "exclusive."
The announcement this week revealed Meta began deploying workloads on Azure’s Nvidia A100-accelerated instances to train AI models in 2021. Meta now plans to expand deployments on Azure to a dedicated cluster consisting of 5,400 of Nvidia’s 80GB A100 GPUs to accelerate AI research and development on “cutting-edge ML training workloads” for its AI business unit.
In fact, the social media giant says it trained its 175 billion parameter OPT-175B natural-language processing transformer model, released in March, in Azure.
“With Azure’s compute power and 1.6TB/s of interconnect bandwidth per VM, we are able to accelerate our ever-growing training demands to better accommodate larger and more innovative AI models,” Meta’s VP of AI Jerome Pesenti said in a statement.
While neither Microsoft nor Meta provided specifics as to how exactly the massive GPU cluster will be used moving forward, there’s a fair chance that, much like the social media giant’s earlier AWS collab, it’ll involve PyTorch.
In addition to the infrastructure deal, Meta said it would collaborate with Microsoft to “scale PyTorch adoption on Azure.”
“We’re happy to work with Microsoft in extending our experience to their customers using PyTorch in their journey from research to production,” Pesenti said.
Later this year, Microsoft plans to roll out PyTorch development accelerators it says will make it easier to deploy the framework on Azure.
The Register reached out to Meta for further comment; we’ll let you know if we hear back. ®