Microsoft’s new H200 v5 series VMs for Azure aim to improve GPU performance
Microsoft has announced the launch of new Azure virtual machines (VMs) specifically aimed at increasing cloud-based AI supercomputing capabilities.
The new H200 v5 series VMs are now generally available to Azure customers and will enable companies to tackle increasingly challenging AI workload requirements.
By leveraging the new VM series, users can boost basic model training and inference capabilities, the tech giant revealed.
Scale, efficiency and performance
In one blog postMicrosoft said the new VM series is already being used by a large number of customers and partners to drive AI capabilities.
“The scale, efficiency and improved performance of our ND H200 v5 VMs are already driving customer adoption and Microsoft AI services such as Azure Machine Learning and Azure OpenAI Service,” the company said.
This includes OpenAI, according to Trevor Cai, OpenAI’s head of infrastructure, which is using the new VM series to drive research and development and refine ChatGPT for users.
“We’re excited to adopt Azure’s new H200 VMs,” he said. “We have seen H200 provide improved performance with minimal porting effort. We look forward to using these VMs to accelerate our research, improve the ChatGPT experience, and further our mission.”
Under the hood of the H200 v5 series
Azure H200 v5 VMS is designed with Microsoft’s systems approach to “improve efficiency and performance,” according to the company, and includes eight Nvidia H200 Tensor Core GPUs.
Microsoft said this addresses a growing computing power “gap” for business users.
As GPUs grow faster in raw compute capabilities than attached memory and memory bandwidth, this has created a bottleneck for AI inference and model training, the tech giant said.
“Azure ND H200 v5 Series VMs deliver a 76% increase in High Bandwidth Memory (HBM) to 141 GB and a 43% increase in HBM bandwidth to 4.8 TB/s over the previous generation Azure ND H100 v5 VMs,” Microsoft said in its announcement.
“This increase in HBM bandwidth allows GPUs to access model parameters more quickly, reducing overall application latency, which is a critical metric for real-time applications such as interactive agents.”
Additionally, the new VM series can also compensate for more complex large language models (LLMs) within the memory of a single machine, the company said. This therefore improves performance and allows users to avoid costly overheads when running distributed applications across multiple VMs.
Better management of GPU memory for model weights and batch sizes is also a key differentiator for the new VM series, according to Microsoft.
Current GPU memory limitations all have a direct impact on throughput and latency for LLM-based inference workloads, and create additional costs for enterprises.
By leveraging increased HBM capacity, the H200 v5 VMs can support larger batch sizes, which Microsoft says dramatically improves GPU utilization and throughput compared to previous iterations.
“In early testing, we observed up to a 35% increase in throughput with ND H200 v5 VMs compared to the ND H100 v5 series for inference workloads running the LLAMA 3.1 405B model (with world size 8, input length 128, output length 8, and maximum batch sizes – 32 for H100 and 96 for H200),” the company said.