xAI’s Colossus supercomputing cluster uses 100,000 Nvidia Hopper GPUs – and it was all made possible using Nvidia’s Spectrum-X Ethernet networking platform
- Nvidia and xAI are working together on the development of Colossus
- xAI has significantly reduced ‘flow collisions’ during AI model training
- Spectrum-X has been crucial in training the Grok AI model family
Nvidia has shed light on how xAI’s ‘Colossus’ supercomputer cluster can handle 100,000 Hopper GPUs – and it’s all thanks to the use of the chipmaker’s Spectrum-X Ethernet networking platform.
Spectrum-X, the company revealed, is designed to deliver massive performance capabilities to multi-tenant, hyperscale AI factories using the Remote Directory Memory Access (RDMA) network.
The platform has been used at Colossus, the world’s largest AI supercomputer, since its inception. The Elon Musk-owned company has used the cluster to train its Grok series of large language models (LLMs), which power the chatbots served to X users.
The facility was built in just 122 days in partnership with Nvidia and xAI is currently expanding it, with plans to deploy a total of 200,000 Nvidia Hopper GPUs.
Training Grok takes some serious firepower
The Grok AI models are extremely large: Grok-1 measures 314 billion parameters and Grok-2 outperforms Claude 3.5 Sonnet and GPT-4 Turbo at the time of launch in August.
Training these models obviously requires significant network performance. Using Nvidia’s Spectrum-X platform, xAI recorded no legacy application degradation or packet loss due to ‘flow collisions’ or bottlenecks within AI network paths.
xAI revealed that it was able to maintain 95% data throughput, powered by Spectrum-X’s congestion control capabilities. The company added that this level of performance cannot be delivered at this scale over standard Ethernet.
According to Nvidia, when using traditional Ethernet, this usually leads to thousands of power collisions while only delivering 60% data throughput.
An xAI spokesperson said the combination of Hopper GPUs and Spectrum-X has allowed the company to “push the boundaries of AI model training” and create a “super-accelerated and optimized AI factory.”
“AI is becoming mission critical and requires better performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at Nvidia.
“The NvidiaSpectrum-X Ethernet networking platform is designed to provide innovators like xAI with faster processing, analysis and execution of AI workloads, in turn accelerating the development, deployment and time to market of AI solutions.”
Part of the Spectrum-X platform includes the Spectrum SN5600 Ethernet switch – which supports port speeds up to 800 Gb/s and is based on the Spectrum-4 switch ASIC, according to Nvidia.
xAI has chosen to combine the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs for better performance.