Google Cloud has warned that some of its servers will soon have their firmware upgraded, an event that will likely disrupt workloads.
An email sent to Google Compute Engine (GCE) customers on Tuesday, and an accompanying incident report, state: “We are experiencing an issue with Google Compute Engine beginning in 2020-08. A firmware rollout is being created that should address the issue.”
“The rollout is currently expected to complete next week, but mitigation efforts are still ongoing,” Google’s advisory added. “Affected customers will experience elevated frequency of Host Maintenance events.”
A what? In this support document Google explains that GPU Host Maintenance events translate into downtime while the underlying cloud platform is updated and tweaked.
“GPU instances cannot be live migrated,” the document reads. “You must set your GPU instances to stop for host maintenance events. If needed, you can set your stopped instances to automatically restart after the maintenance event completes.”
Host Maintenance events are not uncommon. Google warn its cloud subscribers that they should expect one every two weeks, and sometimes more frequently. The idea is you use the gear for batch processing for things like AI training, or take the restarts into account if you need the systems available all the time.
Google says using its cloud servers powered by Nvidia's V100 GPU are unaffected, which tells us that this specific problem impacts servers in the G-fleet that feature other GPU accelerators, such as Nvidia's Tesla P4, T4, K80 and P100.
While GCE customers have some work to do ahead of this event, The Register cannot find evidence that whatever issue the firmware upgrade addresses has created noticeable problems. If the upgrade delivers significant performance improvements it will be a little embarrassing given GPUs and SSDs attract premium prices on the basis of their superior specs. ®